Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] enable metric = 'cosine' for tsne computation #9623

Merged
merged 10 commits into from Oct 2, 2017

Conversation

@oliblum90
Copy link
Contributor

@oliblum90 oliblum90 commented Aug 24, 2017

set neighbors_method to "brute" for metric='cosine', as for metric="cosine" with neighbors_method="ball_tree" the NearestNeighbor method raises the error:

ValueError: Metric 'cosine' not valid for algorithm 'ball_tree'

The following code snippet is working for me in version 0.18.1, but is not working for version 0.19 anymore. With the change I added the snippet works.

from sklearn.manifold import TSNE
import numpy as np

z = np.random.rand(1000*256).reshape((1000, 256))

tsne = TSNE(n_components=2, 
            random_state=0, 
            metric = 'cosine', 
            learning_rate=1000)

tsne.fit_transform(z)
set neighbors_method to "brute" for metric='cosine' as for metric="cosine"  with neighbors_method="ball_tree" the NearestNeighbor method raises the error:

ValueError: Metric 'cosine' not valid for algorithm 'ball_tree'


The following code snippet is working for me in version 0.18.1, but is not work for version 0.19 anymore. With the change I added the snippet works.


from sklearn.manifold import TSNE
import numpy as np

z = np.random.rand(1000*256).reshape((1000, 256))

tsne = TSNE(n_components=2, 
            random_state=0, 
            metric = 'cosine', 
            learning_rate=1000)

tsne.fit_transform(z)
@jnothman jnothman added this to the 0.19.1 milestone Aug 27, 2017
@jnothman jnothman added the Bug label Aug 27, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Aug 27, 2017

Thanks for reporting the regression.

Copy link
Member

@jnothman jnothman left a comment

This needs a test.

@@ -712,7 +712,7 @@ def _fit(self, X, skip_num_points=0):

# Find the nearest neighbors for every point
neighbors_method = 'ball_tree'
if (self.metric == 'precomputed'):
if (self.metric == 'precomputed') or (self.metric == "cosine"):

This comment has been minimized.

@jnothman

jnothman Aug 27, 2017
Member

I don't get why this logic is here at all. I think we should always be using neighbors_method='auto', which will use ball_tree when it can. Ping @tomMoral (who should have his name in the tsne source code)?

This comment has been minimized.

@oliblum90

oliblum90 Aug 27, 2017
Author Contributor

I agree. Shell I make a new pull request?

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 27, 2017

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 27, 2017

When setting NearestNeighbors algorithm='auto', the optimal algorithm for each case (metric) is chosen automatically. Thus is makes no sence to manually distinguish between cases.
@jnothman
Copy link
Member

@jnothman jnothman commented Aug 28, 2017

Please add a non-regression test.

@oliblum90
Copy link
Contributor Author

@oliblum90 oliblum90 commented Aug 28, 2017

As a non-regression test I would suggest to test tsne with several metrics.

I found out that the t-sne tests mostly generate random points in a high dimensional space and find an embedding. Afterwards the trustworthiness of the embedding is computed with sklearn.manifold.t_sne.trustworthiness

There already are several tests for metric='precomputed'

  • test_preserve_trustworthiness_approximately_with_precomputed_distances
  • test_non_square_precomputed_distances
  • test_non_positive_precomputed_distances

So I would suggest to make a test for the metrics [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]. However the method sklearn.manifold.t_sne.trustworthiness does not support different metrics. This actually could be easyly by exchanging the lines 424 and 425 in the file sklearn/manifold/t_sne.py

        dist_X = pairwise_distances(X, squared=True)
    dist_X_embedded = pairwise_distances(X_embedded, squared=True)

with the lines

        dist_X = pairwise_distances(X, squared=True, metric=metric)
    dist_X_embedded = pairwise_distances(X_embedded, squared=True, metric=metric)

but changing this would actually belong in another branch...
what do you think?

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 28, 2017

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Aug 28, 2017

I fixed the method in nearest neighbors to ball_tree because the previous implementation was doing so. It was directly using the BallTree implementation. We discussed setting the method to automatic but wanted to do some benchmarking before I think.

As a side note, you better be really careful with changing the metric. In some sense, if you change the metric, you do not have the "t-distributed" part of t-SNE so I am not sure how we cope with this. All the code is valid for the use of euclidean distances in both the input and output space. The output space distance should not be changed so dist_X_embedded = pairwise_distances(X_embedded, squared=True) should not use the parameter metric.
For the input space, I am not sure of what changing the metric implies. At least, the squared=True option might not be the right parameter as for instance, using the l1 squared metric does not seems the right way to go.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 28, 2017

@oliblum90
Copy link
Contributor Author

@oliblum90 oliblum90 commented Aug 28, 2017

Oh yes! I also did not think about the output space.

What do you suggest now? shell I make a new commit as suggested before just without specifing the metric for the calculation of dist_X_embedded ?

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 28, 2017

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Aug 29, 2017

The metric used in the output space is the euclidean distance. This is hard coded at least in the barnes_hut implementation and this is in accordance with the t-SNE paper. It also makes sense as the euclidean distance is the distance we understand the best so it is well suited for visualization.

For the input space, the metric used in the original paper is also the euclidean distance. I think changing it should be done with care. All the code have been done assuming that the metric used was the euclidean distance. We notably tried to optimized the squared/ not squared computation. So if the metric is changed, we should check that the computed probabilities pij actually make sense.

@oliblum90 if you have time to check the formula for pij, I can review it. If you don't want to do it, let me know and I will try to do it.
EDIT: this is the paper I am refering to for the t-SNE implementation. The pij formula is at the beginning of section 2.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 29, 2017

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Aug 29, 2017

Ok I just checked and the current implementation is correct. It only squares the distance when we use the euclidean distance.

I would probably go for (c), document more clearly the behavior when changing the metric.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 29, 2017

@amueller
Copy link
Member

@amueller amueller commented Aug 30, 2017

The question on whether we want different metrics is a bit separate from this regression, though, right?
If we want to remove metric we need a deprecation cycle. Here we just broke behavior.

@tomMoral you haven't checked that all the code makes sense for different metrics, though? I guess the metric parameter was part of the initial implementation, so I would guess that everything takes it into account?

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Aug 30, 2017

Yes. The behavior is the same as in 0.18 with this PR.

My question is just whether or not the distances should be squared if we want the method to make sense with other metric. But it is probably a topic for an other PR.

@amueller
Copy link
Member

@amueller amueller commented Aug 30, 2017

lgtm as a regression fix

@amueller amueller changed the title enable metric = 'cosine' for tsne computation [MRG + 1] enable metric = 'cosine' for tsne computation Aug 30, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Aug 30, 2017

Before merge, could we please have a test that TSNE(metric='manhattan', random_state=0).fit_transform(X) is equivalent to TSNE(metric='precomputed', random_state=0).fit_transform(manhattan_distances(X))?

@rth rth mentioned this pull request Sep 5, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 5, 2017

@oliblum90, we would like to release 0.19.1 with this regression fixed soon. Could you please complete this or let us know if we should find another contributor. Thanks.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 5, 2017

@tomMoral I also suspect we need to disallow init == 'pca' for precomputed and probably other metrics.

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Sep 6, 2017

@jnothman The init='pca' is disallowed with metric precomputed. For other metrics, I don't know. It seems to be useful to use init='pca' even in these cases as it gives a first structure to the output space (better than random) which is taken with the euclidean space. But it is true that it makes the assumption of euclidean norm in output space. This should probably be benchmarked.

I will open an issue for the metric change in t-SNE, to discuss this specific question and the question of squaring the distance.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 10, 2017

@oliblum90, we'd like to release 0.19.1 soon. Could you please add a regression test for cosine at least, or pass the buck? Thanks.

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Sep 14, 2017

  1. I did the commit but I cannot push it on your branch @oliblum90. Can you give me the write access on your fork, so I can push to this branch?

  2. I would not special case the Euclidean norm. If kdtree is more efficient for low dimensional data, it is worth it to use it as there is not may cases with as many ties as the uniform grid.

@oliblum90
Copy link
Contributor Author

@oliblum90 oliblum90 commented Sep 14, 2017

Ok I added you as collaborator in github. Did it work out?

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 15, 2017

No such luck @tomMoral

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Sep 15, 2017

This time, it is the "exact" implementation failling.
There a discrepancy between linux and windows. The exact implementation is pure python/numpy. The main change is the rng. So I guess this test is not stable.
I locally increased the number of seeds tested and got consistent failures for this level of perplexity and 10 seeds.
With perplexity=30 there is no failures for the first 10 seeds but with 100 seeds, it start failing too both for exact and bh.

Inherently, some initialization might be bad and t-SNE is not able to pass some points over the potentiel barrier. So there is always a situation where this test can fail (statistically). A possible solution to get a more robust test is to re-run the "bad initialization". I propose that when this test fails, we try to re-run T-sne from Y.

I tried this strategy for 200 random initialization with perplexity=20 and only got 10 re-run for barnes_hut and 9 for exact.When I checked the failures, the grid was in two part and t-SNE was not able to unfold it. I think it is a good solution but what is your take @jnothman.

Examples of failures:

barnes_hut_10
exact_25
exact_88
exact_145
barnes_hut_12

Copy link
Member

@jnothman jnothman left a comment

Is this retrying the same as doubling the number of iterations? Or is it very different?

And with a single/shorter run we had numerical instability issues?

Mostly we need the test to be a meaningful measure of correctness, but it's much preferable if it tests the same way on all platforms. Is there a chance we'd get more stability simply by changing the scale of the input grid or something like that??

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Sep 17, 2017

The proposed solution is not the same as doubling the number of iteration because there is a second early_exaggeration step.

The main issue is when it converges to a bad local minima, because the initialization is not good enough. With some initializations, there is a potential barrier that forbids the correct disposition of the points and we end up with solution like the ones displayed above. Re-running it from this local minima with early_exaggeration permits to break through the potential barrier.

I did not try scaling the grid but I don't think this will help to have something more cross-platform.
Is the random initialization the same on all platforms? If so, the main issue is that the code is not consistent as some platforms fail the test with the same initialization.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 17, 2017

I'm pretty sure numpy.random should be cross-platform.

except AssertionError:
# If the test fails a first time, re-run with init=Y to see if
# this was caused by a bad initialization
tsne.init = Y

This comment has been minimized.

@jnothman

jnothman Sep 18, 2017
Member

Instead of this, what do you think, @tomMoral, of just having a counter of how many times the uniform grid was recovered, and checking that, say, 2 seeds of 3 were good.

This comment has been minimized.

@tomMoral

tomMoral Sep 18, 2017
Contributor

I would say that 3 tries are not enough for this 2 out of 3 approach (statistical). However, the test is already slow.

I think the proposed solution is more robust. It asserts that we are able to recover a 2d grid from a random initialization using t-SNE. The retry is just caused by the fact that you can have some bad initialization configuration but does not break the fact that we are converging to the right solution with an optimization schedule good enough.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 18, 2017

well, I'm not entirely persuaded by that test. I think it may deserve a comment that the distance ties made BH-tSNE platform dependent due to numerical imprecision (and perhaps that that perturbation to avoid ties led to failure in exact). If I've got that correctly.

Otherwise, I suppose we should merge this and move on to the rest of 0.19.1.

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Sep 18, 2017

I agree with you, I am not convince by the test. The non-convexity makes the impact of numerical imprecision hard to evaluate. But I do not see an easy way to fix it.

The re-run have the advantage to ensure that we can converge to a grid like solution.

Is the comment okay like that?

Copy link
Contributor

@tomMoral tomMoral left a comment

Could you address this comment? Except for it, this PR seems okay to me, the uniform_grid issue is not related and should be handled separately.

tsne_2 = TSNE(metric='precomputed',
n_components=n_components_embedding,
random_state=0).fit_transform(dist_func(X))
t = trustworthiness(tsne_1, tsne_2, n_neighbors=1)

This comment has been minimized.

@tomMoral

tomMoral Sep 20, 2017
Contributor

The trustworthiness is designed to test the validity of the results, between the input and the output space, using the euclidean distance.
You should change this test to assert that tsne_1 and tsne_2 are the same, .

This comment has been minimized.

@oliblum90

oliblum90 Sep 21, 2017
Author Contributor

Done

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 25, 2017

I think we should get 0.19.1 out in the next week or so. Another review for merge? @amueller, @lesteve?

Some background: when changing to use NearestNeighbors(algorithm='auto') we found that the uniform grid test failed on some platforms due to KDTree returning neighborhoods in a different order than BallTree. @tomMoral argues that the test is brittle to bad initialisations and so has patched it with a retry mechanism, which does not have test coverage on Travis, but helps the tests to pass overall.

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Sep 25, 2017

I proposed the retry mechanism based on the observation that when we increase the number of seeds used in the test, we end up with failure without this retry mechanism. On certain platforms, the method might not be concistent (because of the ties handeling in kdtree if I understood correctly), and this retry mechanism should make sure that this will not cause failures.

My point for the retry mechanism: the optimization is not convex and some configuration might end up in bad local minima. Retrying permits to escape these local minima because we restart from the final point of the previous try and have some early exaggeration steps which escape the minima.

Copy link
Member

@lesteve lesteve left a comment

LGTM even if I have to admit I don't understand the details of TSNE.

Should we have a warning if the metric is not euclidean because we are not sure that it is even a good idea to do that according to #9623 (comment).

Also a minor nitpick while I was a it.

metrics = ['manhattan', 'cosine']
dist_funcs = [manhattan_distances, cosine_distances]
for metric, dist_func in zip(metrics, dist_funcs):
tsne_1 = TSNE(metric=metric, n_components=n_components_embedding,

This comment has been minimized.

@lesteve

lesteve Sep 28, 2017
Member

I am not a big fan of variable lazy naming, even less when they are confusing like this (tsne_1 makes you think this is TSNE estimator while it is a transformed X ...). You could do

X_transformed_tsne = TSNE(...).fit_transform(...)
X_transformed_tsne_precomputed = TSNE(...).fit_transform(...)
@oliblum90
Copy link
Contributor Author

@oliblum90 oliblum90 commented Sep 28, 2017

To visualize neural network features often cosine similarity is the best choise for t-sne (see here). Neural Networks become more and more important. So I think it is pretty important to include this distance metric.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 1, 2017

@tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Oct 2, 2017

I think it is the right solution. This fixes the regression and we can discuss in #9695 about the proper support of other metrics.

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 2, 2017

Thanks @tomMoral I pushed the fix for the nitpick I had and will merge when this is green.

@lesteve
Copy link
Member

@lesteve lesteve commented Oct 2, 2017

Merging, thanks a lot @oliblum90 and @tomMoral!

@lesteve lesteve merged commit 1701fcf into scikit-learn:master Oct 2, 2017
5 of 6 checks passed
5 of 6 checks passed
codecov/patch 89.28% of diff hit (target 96.16%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.16% (+<.01%) compared to d8c363f
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants