New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] enable metric = 'cosine' for tsne computation #9623

Merged
merged 10 commits into from Oct 2, 2017

Conversation

Projects
None yet
5 participants
@oliblum90
Contributor

oliblum90 commented Aug 24, 2017

set neighbors_methodto "brute" for metric='cosine', as for metric="cosine" with neighbors_method="ball_tree" the NearestNeighbor method raises the error:

ValueError: Metric 'cosine' not valid for algorithm 'ball_tree'

The following code snippet is working for me in version 0.18.1, but is not working for version 0.19 anymore. With the change I added the snippet works.

from sklearn.manifold import TSNE
import numpy as np

z = np.random.rand(1000*256).reshape((1000, 256))

tsne = TSNE(n_components=2, 
            random_state=0, 
            metric = 'cosine', 
            learning_rate=1000)

tsne.fit_transform(z)
enable metric = 'cosine' for tsne computation
set neighbors_method to "brute" for metric='cosine' as for metric="cosine"  with neighbors_method="ball_tree" the NearestNeighbor method raises the error:

ValueError: Metric 'cosine' not valid for algorithm 'ball_tree'


The following code snippet is working for me in version 0.18.1, but is not work for version 0.19 anymore. With the change I added the snippet works.


from sklearn.manifold import TSNE
import numpy as np

z = np.random.rand(1000*256).reshape((1000, 256))

tsne = TSNE(n_components=2, 
            random_state=0, 
            metric = 'cosine', 
            learning_rate=1000)

tsne.fit_transform(z)

@jnothman jnothman added this to the 0.19.1 milestone Aug 27, 2017

@jnothman jnothman added the Bug label Aug 27, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 27, 2017

Member

Thanks for reporting the regression.

Member

jnothman commented Aug 27, 2017

Thanks for reporting the regression.

@jnothman

This needs a test.

Show outdated Hide outdated sklearn/manifold/t_sne.py
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 27, 2017

Member
Member

jnothman commented Aug 27, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 27, 2017

Member
Member

jnothman commented Aug 27, 2017

set NearestNeighbors algorithm to 'auto'
When setting NearestNeighbors algorithm='auto', the optimal algorithm for each case (metric) is chosen automatically. Thus is makes no sence to manually distinguish between cases.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2017

Member

Please add a non-regression test.

Member

jnothman commented Aug 28, 2017

Please add a non-regression test.

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Aug 28, 2017

Contributor

As a non-regression test I would suggest to test tsne with several metrics.

I found out that the t-sne tests mostly generate random points in a high dimensional space and find an embedding. Afterwards the trustworthiness of the embedding is computed with sklearn.manifold.t_sne.trustworthiness

There already are several tests for metric='precomputed'

  • test_preserve_trustworthiness_approximately_with_precomputed_distances
  • test_non_square_precomputed_distances
  • test_non_positive_precomputed_distances

So I would suggest to make a test for the metrics [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]. However the method sklearn.manifold.t_sne.trustworthiness does not support different metrics. This actually could be easyly by exchanging the lines 424 and 425 in the file sklearn/manifold/t_sne.py

        dist_X = pairwise_distances(X, squared=True)
    dist_X_embedded = pairwise_distances(X_embedded, squared=True)

with the lines

        dist_X = pairwise_distances(X, squared=True, metric=metric)
    dist_X_embedded = pairwise_distances(X_embedded, squared=True, metric=metric)

but changing this would actually belong in another branch...
what do you think?

Contributor

oliblum90 commented Aug 28, 2017

As a non-regression test I would suggest to test tsne with several metrics.

I found out that the t-sne tests mostly generate random points in a high dimensional space and find an embedding. Afterwards the trustworthiness of the embedding is computed with sklearn.manifold.t_sne.trustworthiness

There already are several tests for metric='precomputed'

  • test_preserve_trustworthiness_approximately_with_precomputed_distances
  • test_non_square_precomputed_distances
  • test_non_positive_precomputed_distances

So I would suggest to make a test for the metrics [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]. However the method sklearn.manifold.t_sne.trustworthiness does not support different metrics. This actually could be easyly by exchanging the lines 424 and 425 in the file sklearn/manifold/t_sne.py

        dist_X = pairwise_distances(X, squared=True)
    dist_X_embedded = pairwise_distances(X_embedded, squared=True)

with the lines

        dist_X = pairwise_distances(X, squared=True, metric=metric)
    dist_X_embedded = pairwise_distances(X_embedded, squared=True, metric=metric)

but changing this would actually belong in another branch...
what do you think?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2017

Member
Member

jnothman commented Aug 28, 2017

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Aug 28, 2017

Contributor

I fixed the method in nearest neighbors to ball_tree because the previous implementation was doing so. It was directly using the BallTree implementation. We discussed setting the method to automatic but wanted to do some benchmarking before I think.

As a side note, you better be really careful with changing the metric. In some sense, if you change the metric, you do not have the "t-distributed" part of t-SNE so I am not sure how we cope with this. All the code is valid for the use of euclidean distances in both the input and output space. The output space distance should not be changed so dist_X_embedded = pairwise_distances(X_embedded, squared=True) should not use the parameter metric.
For the input space, I am not sure of what changing the metric implies. At least, the squared=True option might not be the right parameter as for instance, using the l1 squared metric does not seems the right way to go.

Contributor

tomMoral commented Aug 28, 2017

I fixed the method in nearest neighbors to ball_tree because the previous implementation was doing so. It was directly using the BallTree implementation. We discussed setting the method to automatic but wanted to do some benchmarking before I think.

As a side note, you better be really careful with changing the metric. In some sense, if you change the metric, you do not have the "t-distributed" part of t-SNE so I am not sure how we cope with this. All the code is valid for the use of euclidean distances in both the input and output space. The output space distance should not be changed so dist_X_embedded = pairwise_distances(X_embedded, squared=True) should not use the parameter metric.
For the input space, I am not sure of what changing the metric implies. At least, the squared=True option might not be the right parameter as for instance, using the l1 squared metric does not seems the right way to go.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2017

Member
Member

jnothman commented Aug 28, 2017

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Aug 28, 2017

Contributor

Oh yes! I also did not think about the output space.

What do you suggest now? shell I make a new commit as suggested before just without specifing the metric for the calculation of dist_X_embedded ?

Contributor

oliblum90 commented Aug 28, 2017

Oh yes! I also did not think about the output space.

What do you suggest now? shell I make a new commit as suggested before just without specifing the metric for the calculation of dist_X_embedded ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2017

Member
Member

jnothman commented Aug 28, 2017

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Aug 29, 2017

Contributor

The metric used in the output space is the euclidean distance. This is hard coded at least in the barnes_hut implementation and this is in accordance with the t-SNE paper. It also makes sense as the euclidean distance is the distance we understand the best so it is well suited for visualization.

For the input space, the metric used in the original paper is also the euclidean distance. I think changing it should be done with care. All the code have been done assuming that the metric used was the euclidean distance. We notably tried to optimized the squared/ not squared computation. So if the metric is changed, we should check that the computed probabilities pij actually make sense.

@oliblum90 if you have time to check the formula for pij, I can review it. If you don't want to do it, let me know and I will try to do it.
EDIT: this is the paper I am refering to for the t-SNE implementation. The pij formula is at the beginning of section 2.

Contributor

tomMoral commented Aug 29, 2017

The metric used in the output space is the euclidean distance. This is hard coded at least in the barnes_hut implementation and this is in accordance with the t-SNE paper. It also makes sense as the euclidean distance is the distance we understand the best so it is well suited for visualization.

For the input space, the metric used in the original paper is also the euclidean distance. I think changing it should be done with care. All the code have been done assuming that the metric used was the euclidean distance. We notably tried to optimized the squared/ not squared computation. So if the metric is changed, we should check that the computed probabilities pij actually make sense.

@oliblum90 if you have time to check the formula for pij, I can review it. If you don't want to do it, let me know and I will try to do it.
EDIT: this is the paper I am refering to for the t-SNE implementation. The pij formula is at the beginning of section 2.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 29, 2017

Member
Member

jnothman commented Aug 29, 2017

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Aug 29, 2017

Contributor

Ok I just checked and the current implementation is correct. It only squares the distance when we use the euclidean distance.

I would probably go for (c), document more clearly the behavior when changing the metric.

Contributor

tomMoral commented Aug 29, 2017

Ok I just checked and the current implementation is correct. It only squares the distance when we use the euclidean distance.

I would probably go for (c), document more clearly the behavior when changing the metric.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 29, 2017

Member
Member

jnothman commented Aug 29, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 30, 2017

Member

The question on whether we want different metrics is a bit separate from this regression, though, right?
If we want to remove metric we need a deprecation cycle. Here we just broke behavior.

@tomMoral you haven't checked that all the code makes sense for different metrics, though? I guess the metric parameter was part of the initial implementation, so I would guess that everything takes it into account?

Member

amueller commented Aug 30, 2017

The question on whether we want different metrics is a bit separate from this regression, though, right?
If we want to remove metric we need a deprecation cycle. Here we just broke behavior.

@tomMoral you haven't checked that all the code makes sense for different metrics, though? I guess the metric parameter was part of the initial implementation, so I would guess that everything takes it into account?

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Aug 30, 2017

Contributor

Yes. The behavior is the same as in 0.18 with this PR.

My question is just whether or not the distances should be squared if we want the method to make sense with other metric. But it is probably a topic for an other PR.

Contributor

tomMoral commented Aug 30, 2017

Yes. The behavior is the same as in 0.18 with this PR.

My question is just whether or not the distances should be squared if we want the method to make sense with other metric. But it is probably a topic for an other PR.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 30, 2017

Member

lgtm as a regression fix

Member

amueller commented Aug 30, 2017

lgtm as a regression fix

@amueller amueller changed the title from enable metric = 'cosine' for tsne computation to [MRG + 1] enable metric = 'cosine' for tsne computation Aug 30, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 30, 2017

Member

Before merge, could we please have a test that TSNE(metric='manhattan', random_state=0).fit_transform(X) is equivalent to TSNE(metric='precomputed', random_state=0).fit_transform(manhattan_distances(X))?

Member

jnothman commented Aug 30, 2017

Before merge, could we please have a test that TSNE(metric='manhattan', random_state=0).fit_transform(X) is equivalent to TSNE(metric='precomputed', random_state=0).fit_transform(manhattan_distances(X))?

@rth rth referenced this pull request Sep 5, 2017

Open

T-SNE fails for CSR matrix #9691

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 5, 2017

Member

@oliblum90, we would like to release 0.19.1 with this regression fixed soon. Could you please complete this or let us know if we should find another contributor. Thanks.

Member

jnothman commented Sep 5, 2017

@oliblum90, we would like to release 0.19.1 with this regression fixed soon. Could you please complete this or let us know if we should find another contributor. Thanks.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 5, 2017

Member

@tomMoral I also suspect we need to disallow init == 'pca' for precomputed and probably other metrics.

Member

jnothman commented Sep 5, 2017

@tomMoral I also suspect we need to disallow init == 'pca' for precomputed and probably other metrics.

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 6, 2017

Contributor

@jnothman The init='pca' is disallowed with metric precomputed. For other metrics, I don't know. It seems to be useful to use init='pca' even in these cases as it gives a first structure to the output space (better than random) which is taken with the euclidean space. But it is true that it makes the assumption of euclidean norm in output space. This should probably be benchmarked.

I will open an issue for the metric change in t-SNE, to discuss this specific question and the question of squaring the distance.

Contributor

tomMoral commented Sep 6, 2017

@jnothman The init='pca' is disallowed with metric precomputed. For other metrics, I don't know. It seems to be useful to use init='pca' even in these cases as it gives a first structure to the output space (better than random) which is taken with the euclidean space. But it is true that it makes the assumption of euclidean norm in output space. This should probably be benchmarked.

I will open an issue for the metric change in t-SNE, to discuss this specific question and the question of squaring the distance.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 10, 2017

Member

@oliblum90, we'd like to release 0.19.1 soon. Could you please add a regression test for cosine at least, or pass the buck? Thanks.

Member

jnothman commented Sep 10, 2017

@oliblum90, we'd like to release 0.19.1 soon. Could you please add a regression test for cosine at least, or pass the buck? Thanks.

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 11, 2017

Contributor

ok I try to get it done today or tomorrow

Contributor

oliblum90 commented Sep 11, 2017

ok I try to get it done today or tomorrow

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 11, 2017

Member
Member

jnothman commented Sep 11, 2017

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 11, 2017

Contributor

I accidentally made a new pull request (#9732) for the test I added. Shell I leave it like that? Otherwise please explain to me how I can add the test to the current pull request wiht the github-online interface. Sorry for the trouble.

Contributor

oliblum90 commented Sep 11, 2017

I accidentally made a new pull request (#9732) for the test I added. Shell I leave it like that? Otherwise please explain to me how I can add the test to the current pull request wiht the github-online interface. Sorry for the trouble.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 11, 2017

Member

@oliblum90 I don't think you can do that with the online interface, but you can just push to the same branch again (patch-5).

Member

amueller commented Sep 11, 2017

@oliblum90 I don't think you can do that with the online interface, but you can just push to the same branch again (patch-5).

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 11, 2017

Member

Something like:

git checkout patch-5
git merge patch-6
git push origin patch-5
Member

amueller commented Sep 11, 2017

Something like:

git checkout patch-5
git merge patch-6
git push origin patch-5
created test for tsne with different distance metrics
For a set of points the pairwise distance is computed considering a certain distance metric. The distance array is used as a precomputed distance array to compute a tsne-embedding. Afterwards another tsne-embedding  is computed directly using the same distance metric. At the end of the test it is ensured, that the two computed tsne-embeddings correspond to each other.
@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 11, 2017

Contributor

ok, done. Thank you.

Contributor

oliblum90 commented Sep 11, 2017

ok, done. Thank you.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 11, 2017

Member

You should be able to just add, commit and push to make changes

Member

jnothman commented Sep 11, 2017

You should be able to just add, commit and push to make changes

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 12, 2017

Contributor

Hm,... now I dont get why it fails. Has someone have an idea?

FAIL: sklearn.manifold.tests.test_t_sne.test_uniform_grid('barnes_hut',)

Traceback (most recent call last):
File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
self.test(*self.arg)
File "C:\Python27\lib\site-packages\sklearn\manifold\tests\test_t_sne.py", line 741, in check_uniform_grid
assert_less(largest_to_mean, 2, msg=try_name)
AssertionError: barnes_hut_1

Contributor

oliblum90 commented Sep 12, 2017

Hm,... now I dont get why it fails. Has someone have an idea?

FAIL: sklearn.manifold.tests.test_t_sne.test_uniform_grid('barnes_hut',)

Traceback (most recent call last):
File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
self.test(*self.arg)
File "C:\Python27\lib\site-packages\sklearn\manifold\tests\test_t_sne.py", line 741, in check_uniform_grid
assert_less(largest_to_mean, 2, msg=try_name)
AssertionError: barnes_hut_1

@jnothman

I'm on the move atm but I'm surprised at the test failing on windows (appveyor) and not on Linux (Travis). Heisenbug, or something caused by this PR?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 12, 2017

Member

No, it's not a heisenbug. It's happened consistently in this PR.

Member

jnothman commented Sep 12, 2017

No, it's not a heisenbug. It's happened consistently in this PR.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 12, 2017

Member

So the source of the error is that:

  • NearestNeighbors is choosing a kd_tree rather than a ball_tree because it's a small dataset
  • because the data is in a grid, nearest neighbors are often tied. KD tree and ball tree are returning results in different orders
  • apparently KDTree isn't stable across the windows-linux divide, at least where ties are involved (I suppose this isn't surprising)
  • apparently this test is brittle to such variation :( ping @tomMoral

A quick fix is to still hard-code ball-tree for euclidean.

Member

jnothman commented Sep 12, 2017

So the source of the error is that:

  • NearestNeighbors is choosing a kd_tree rather than a ball_tree because it's a small dataset
  • because the data is in a grid, nearest neighbors are often tied. KD tree and ball tree are returning results in different orders
  • apparently KDTree isn't stable across the windows-linux divide, at least where ties are involved (I suppose this isn't surprising)
  • apparently this test is brittle to such variation :( ping @tomMoral

A quick fix is to still hard-code ball-tree for euclidean.

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 12, 2017

Contributor

We knew this test was brittle but not why.. but now it makes sense, trustworthiness is computed using the neighbors and might breaks because of this inconsistencies.

I will try to come up with a more robust test this afternoon.

Contributor

tomMoral commented Sep 12, 2017

We knew this test was brittle but not why.. but now it makes sense, trustworthiness is computed using the neighbors and might breaks because of this inconsistencies.

I will try to come up with a more robust test this afternoon.

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 12, 2017

Contributor

@tomMoral: would I then have to rebase on your branch?

Contributor

oliblum90 commented Sep 12, 2017

@tomMoral: would I then have to rebase on your branch?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 12, 2017

Member

In this case the brittleness is in the bhtsne use of NN. Can we use deterministic random perturbation to make the test more stable?

Member

jnothman commented Sep 12, 2017

In this case the brittleness is in the bhtsne use of NN. Can we use deterministic random perturbation to make the test more stable?

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 12, 2017

Contributor

Yes it might be a good idea to either

  • Add a deterministic random perturbation. between line 719/720, add
    X_2d_grid += 1e-5 * check_random_state(seed).normal(size=X_2d_grid.shape) to make NN deterministic
  • Increase the perplexity parameter, to reduce this effect. The number of neighbors used is k= 3*perplexity +1 so using perplexity=13 should select all ties (on the uniform grid, there is ties for 4 point at the time).

I think the first option is the cleaner.

For the brittleness of the trustworthiness, I though it could be the same problem for the failure in test_preserve_trustworthiness but after, checking, we do not use NN in trustworthiness computation.

Contributor

tomMoral commented Sep 12, 2017

Yes it might be a good idea to either

  • Add a deterministic random perturbation. between line 719/720, add
    X_2d_grid += 1e-5 * check_random_state(seed).normal(size=X_2d_grid.shape) to make NN deterministic
  • Increase the perplexity parameter, to reduce this effect. The number of neighbors used is k= 3*perplexity +1 so using perplexity=13 should select all ties (on the uniform grid, there is ties for 4 point at the time).

I think the first option is the cleaner.

For the brittleness of the trustworthiness, I though it could be the same problem for the failure in test_preserve_trustworthiness but after, checking, we do not use NN in trustworthiness computation.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 12, 2017

Member
  1. Do you want to add the perturbation, @tomMoral, or should @oliblum90 have a go?
  2. Should we still special-case Euclidean to use ball tree?
Member

jnothman commented Sep 12, 2017

  1. Do you want to add the perturbation, @tomMoral, or should @oliblum90 have a go?
  2. Should we still special-case Euclidean to use ball tree?
@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 14, 2017

Contributor
  1. I did the commit but I cannot push it on your branch @oliblum90. Can you give me the write access on your fork, so I can push to this branch?

  2. I would not special case the Euclidean norm. If kdtree is more efficient for low dimensional data, it is worth it to use it as there is not may cases with as many ties as the uniform grid.

Contributor

tomMoral commented Sep 14, 2017

  1. I did the commit but I cannot push it on your branch @oliblum90. Can you give me the write access on your fork, so I can push to this branch?

  2. I would not special case the Euclidean norm. If kdtree is more efficient for low dimensional data, it is worth it to use it as there is not may cases with as many ties as the uniform grid.

@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 14, 2017

Contributor

Ok I added you as collaborator in github. Did it work out?

Contributor

oliblum90 commented Sep 14, 2017

Ok I added you as collaborator in github. Did it work out?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 15, 2017

Member

No such luck @tomMoral

Member

jnothman commented Sep 15, 2017

No such luck @tomMoral

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 15, 2017

Contributor

This time, it is the "exact" implementation failling.
There a discrepancy between linux and windows. The exact implementation is pure python/numpy. The main change is the rng. So I guess this test is not stable.
I locally increased the number of seeds tested and got consistent failures for this level of perplexity and 10 seeds.
With perplexity=30 there is no failures for the first 10 seeds but with 100 seeds, it start failing too both for exact and bh.

Inherently, some initialization might be bad and t-SNE is not able to pass some points over the potentiel barrier. So there is always a situation where this test can fail (statistically). A possible solution to get a more robust test is to re-run the "bad initialization". I propose that when this test fails, we try to re-run T-sne from Y.

I tried this strategy for 200 random initialization with perplexity=20 and only got 10 re-run for barnes_hut and 9 for exact.When I checked the failures, the grid was in two part and t-SNE was not able to unfold it. I think it is a good solution but what is your take @jnothman.

Examples of failures:

barnes_hut_10
exact_25
exact_88
exact_145
barnes_hut_12

Contributor

tomMoral commented Sep 15, 2017

This time, it is the "exact" implementation failling.
There a discrepancy between linux and windows. The exact implementation is pure python/numpy. The main change is the rng. So I guess this test is not stable.
I locally increased the number of seeds tested and got consistent failures for this level of perplexity and 10 seeds.
With perplexity=30 there is no failures for the first 10 seeds but with 100 seeds, it start failing too both for exact and bh.

Inherently, some initialization might be bad and t-SNE is not able to pass some points over the potentiel barrier. So there is always a situation where this test can fail (statistically). A possible solution to get a more robust test is to re-run the "bad initialization". I propose that when this test fails, we try to re-run T-sne from Y.

I tried this strategy for 200 random initialization with perplexity=20 and only got 10 re-run for barnes_hut and 9 for exact.When I checked the failures, the grid was in two part and t-SNE was not able to unfold it. I think it is a good solution but what is your take @jnothman.

Examples of failures:

barnes_hut_10
exact_25
exact_88
exact_145
barnes_hut_12

@jnothman

Is this retrying the same as doubling the number of iterations? Or is it very different?

And with a single/shorter run we had numerical instability issues?

Mostly we need the test to be a meaningful measure of correctness, but it's much preferable if it tests the same way on all platforms. Is there a chance we'd get more stability simply by changing the scale of the input grid or something like that??

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 17, 2017

Contributor

The proposed solution is not the same as doubling the number of iteration because there is a second early_exaggeration step.

The main issue is when it converges to a bad local minima, because the initialization is not good enough. With some initializations, there is a potential barrier that forbids the correct disposition of the points and we end up with solution like the ones displayed above. Re-running it from this local minima with early_exaggeration permits to break through the potential barrier.

I did not try scaling the grid but I don't think this will help to have something more cross-platform.
Is the random initialization the same on all platforms? If so, the main issue is that the code is not consistent as some platforms fail the test with the same initialization.

Contributor

tomMoral commented Sep 17, 2017

The proposed solution is not the same as doubling the number of iteration because there is a second early_exaggeration step.

The main issue is when it converges to a bad local minima, because the initialization is not good enough. With some initializations, there is a potential barrier that forbids the correct disposition of the points and we end up with solution like the ones displayed above. Re-running it from this local minima with early_exaggeration permits to break through the potential barrier.

I did not try scaling the grid but I don't think this will help to have something more cross-platform.
Is the random initialization the same on all platforms? If so, the main issue is that the code is not consistent as some platforms fail the test with the same initialization.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 17, 2017

Member

I'm pretty sure numpy.random should be cross-platform.

Member

jnothman commented Sep 17, 2017

I'm pretty sure numpy.random should be cross-platform.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 18, 2017

Member

well, I'm not entirely persuaded by that test. I think it may deserve a comment that the distance ties made BH-tSNE platform dependent due to numerical imprecision (and perhaps that that perturbation to avoid ties led to failure in exact). If I've got that correctly.

Otherwise, I suppose we should merge this and move on to the rest of 0.19.1.

Member

jnothman commented Sep 18, 2017

well, I'm not entirely persuaded by that test. I think it may deserve a comment that the distance ties made BH-tSNE platform dependent due to numerical imprecision (and perhaps that that perturbation to avoid ties led to failure in exact). If I've got that correctly.

Otherwise, I suppose we should merge this and move on to the rest of 0.19.1.

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 18, 2017

Contributor

I agree with you, I am not convince by the test. The non-convexity makes the impact of numerical imprecision hard to evaluate. But I do not see an easy way to fix it.

The re-run have the advantage to ensure that we can converge to a grid like solution.

Is the comment okay like that?

Contributor

tomMoral commented Sep 18, 2017

I agree with you, I am not convince by the test. The non-convexity makes the impact of numerical imprecision hard to evaluate. But I do not see an easy way to fix it.

The re-run have the advantage to ensure that we can converge to a grid like solution.

Is the comment okay like that?

@tomMoral

Could you address this comment? Except for it, this PR seems okay to me, the uniform_grid issue is not related and should be handled separately.

Show outdated Hide outdated sklearn/manifold/tests/test_t_sne.py
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 25, 2017

Member

I think we should get 0.19.1 out in the next week or so. Another review for merge? @amueller, @lesteve?

Some background: when changing to use NearestNeighbors(algorithm='auto') we found that the uniform grid test failed on some platforms due to KDTree returning neighborhoods in a different order than BallTree. @tomMoral argues that the test is brittle to bad initialisations and so has patched it with a retry mechanism, which does not have test coverage on Travis, but helps the tests to pass overall.

Member

jnothman commented Sep 25, 2017

I think we should get 0.19.1 out in the next week or so. Another review for merge? @amueller, @lesteve?

Some background: when changing to use NearestNeighbors(algorithm='auto') we found that the uniform grid test failed on some platforms due to KDTree returning neighborhoods in a different order than BallTree. @tomMoral argues that the test is brittle to bad initialisations and so has patched it with a retry mechanism, which does not have test coverage on Travis, but helps the tests to pass overall.

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 25, 2017

Contributor

I proposed the retry mechanism based on the observation that when we increase the number of seeds used in the test, we end up with failure without this retry mechanism. On certain platforms, the method might not be concistent (because of the ties handeling in kdtree if I understood correctly), and this retry mechanism should make sure that this will not cause failures.

My point for the retry mechanism: the optimization is not convex and some configuration might end up in bad local minima. Retrying permits to escape these local minima because we restart from the final point of the previous try and have some early exaggeration steps which escape the minima.

Contributor

tomMoral commented Sep 25, 2017

I proposed the retry mechanism based on the observation that when we increase the number of seeds used in the test, we end up with failure without this retry mechanism. On certain platforms, the method might not be concistent (because of the ties handeling in kdtree if I understood correctly), and this retry mechanism should make sure that this will not cause failures.

My point for the retry mechanism: the optimization is not convex and some configuration might end up in bad local minima. Retrying permits to escape these local minima because we restart from the final point of the previous try and have some early exaggeration steps which escape the minima.

@lesteve

LGTM even if I have to admit I don't understand the details of TSNE.

Should we have a warning if the metric is not euclidean because we are not sure that it is even a good idea to do that according to #9623 (comment).

Also a minor nitpick while I was a it.

Show outdated Hide outdated sklearn/manifold/tests/test_t_sne.py
@oliblum90

This comment has been minimized.

Show comment
Hide comment
@oliblum90

oliblum90 Sep 28, 2017

Contributor

To visualize neural network features often cosine similarity is the best choise for t-sne (see here). Neural Networks become more and more important. So I think it is pretty important to include this distance metric.

Contributor

oliblum90 commented Sep 28, 2017

To visualize neural network features often cosine similarity is the best choise for t-sne (see here). Neural Networks become more and more important. So I think it is pretty important to include this distance metric.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 1, 2017

Member
Member

jnothman commented Oct 1, 2017

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Oct 2, 2017

Contributor

I think it is the right solution. This fixes the regression and we can discuss in #9695 about the proper support of other metrics.

Contributor

tomMoral commented Oct 2, 2017

I think it is the right solution. This fixes the regression and we can discuss in #9695 about the proper support of other metrics.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 2, 2017

Member

Thanks @tomMoral I pushed the fix for the nitpick I had and will merge when this is green.

Member

lesteve commented Oct 2, 2017

Thanks @tomMoral I pushed the fix for the nitpick I had and will merge when this is green.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 2, 2017

Member

Merging, thanks a lot @oliblum90 and @tomMoral!

Member

lesteve commented Oct 2, 2017

Merging, thanks a lot @oliblum90 and @tomMoral!

@lesteve lesteve merged commit 1701fcf into scikit-learn:master Oct 2, 2017

5 of 6 checks passed

codecov/patch 89.28% of diff hit (target 96.16%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.16% (+<.01%) compared to d8c363f
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017

@lesteve lesteve referenced this pull request Oct 3, 2017

Merged

Release of version 0.19.1 #9607

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment