[MRG + 1] enable metric = 'cosine' for tsne computation #9623
Conversation
set neighbors_method to "brute" for metric='cosine' as for metric="cosine" with neighbors_method="ball_tree" the NearestNeighbor method raises the error: ValueError: Metric 'cosine' not valid for algorithm 'ball_tree' The following code snippet is working for me in version 0.18.1, but is not work for version 0.19 anymore. With the change I added the snippet works. from sklearn.manifold import TSNE import numpy as np z = np.random.rand(1000*256).reshape((1000, 256)) tsne = TSNE(n_components=2, random_state=0, metric = 'cosine', learning_rate=1000) tsne.fit_transform(z)
Thanks for reporting the regression. |
This needs a test. |
@@ -712,7 +712,7 @@ def _fit(self, X, skip_num_points=0): | |||
|
|||
# Find the nearest neighbors for every point | |||
neighbors_method = 'ball_tree' | |||
if (self.metric == 'precomputed'): | |||
if (self.metric == 'precomputed') or (self.metric == "cosine"): |
jnothman
Aug 27, 2017
Member
I don't get why this logic is here at all. I think we should always be using neighbors_method='auto'
, which will use ball_tree
when it can. Ping @tomMoral (who should have his name in the tsne source code)?
I don't get why this logic is here at all. I think we should always be using neighbors_method='auto'
, which will use ball_tree
when it can. Ping @tomMoral (who should have his name in the tsne source code)?
oliblum90
Aug 27, 2017
Author
Contributor
I agree. Shell I make a new pull request?
I agree. Shell I make a new pull request?
No, just modify this one, please.
…On 27 August 2017 at 19:18, oliblum90 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/manifold/t_sne.py
<#9623 (comment)>
:
> @@ -712,7 +712,7 @@ def _fit(self, X, skip_num_points=0):
# Find the nearest neighbors for every point
neighbors_method = 'ball_tree'
- if (self.metric == 'precomputed'):
+ if (self.metric == 'precomputed') or (self.metric == "cosine"):
I agree. Shell I make a new pull request?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_AVIyAB1F-yvn322al6IM-dpGi3ks5scTRzgaJpZM4PBfnx>
.
|
You can just add more commits.
…On 28 August 2017 at 00:54, Joel Nothman ***@***.***> wrote:
No, just modify this one, please.
On 27 August 2017 at 19:18, oliblum90 ***@***.***> wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In sklearn/manifold/t_sne.py
> <#9623 (comment)>
> :
>
> > @@ -712,7 +712,7 @@ def _fit(self, X, skip_num_points=0):
>
> # Find the nearest neighbors for every point
> neighbors_method = 'ball_tree'
> - if (self.metric == 'precomputed'):
> + if (self.metric == 'precomputed') or (self.metric == "cosine"):
>
> I agree. Shell I make a new pull request?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#9623 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6_AVIyAB1F-yvn322al6IM-dpGi3ks5scTRzgaJpZM4PBfnx>
> .
>
|
When setting NearestNeighbors algorithm='auto', the optimal algorithm for each case (metric) is chosen automatically. Thus is makes no sence to manually distinguish between cases.
Please add a non-regression test. |
As a non-regression test I would suggest to test tsne with several metrics. I found out that the t-sne tests mostly generate random points in a high dimensional space and find an embedding. Afterwards the trustworthiness of the embedding is computed with There already are several tests for
So I would suggest to make a test for the metrics [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]. However the method
with the lines
but changing this would actually belong in another branch... |
sounds like a good idea
…On 28 Aug 2017 6:47 pm, "oliblum90" ***@***.***> wrote:
As a non-regression test I would suggest to test tsne with several metrics.
I found out that the t-sne tests mostly generate random points in a high
dimensional space and find an embedding. Afterwards the trustworthiness of
the embedding is computed with sklearn.manifold.t_sne.trustworthiness
There already are several tests for metric='precomputed'
- test_preserve_trustworthiness_approximately_with_
precomputed_distances
- test_non_square_precomputed_distances
- test_non_positive_precomputed_distances
So I would suggest to make a test for the metrics [‘cityblock’, ‘cosine’,
‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]. However the method
sklearn.manifold.t_sne.trustworthiness does not support different
metrics. This actually could be easyly by exchanging the lines 424 and 425
in the file sklearn/manifold/t_sne.py
dist_X = pairwise_distances(X, squared=True)
dist_X_embedded = pairwise_distances(X_embedded, squared=True)
with the lines
dist_X = pairwise_distances(X, squared=True, metric=metric)
dist_X_embedded = pairwise_distances(X_embedded, squared=True, metric=metric)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62pPVkyWHMLe2RTUyqx824LeFC31ks5scn6EgaJpZM4PBfnx>
.
|
I fixed the method in nearest neighbors to As a side note, you better be really careful with changing the metric. In some sense, if you change the metric, you do not have the "t-distributed" part of t-SNE so I am not sure how we cope with this. All the code is valid for the use of euclidean distances in both the input and output space. The output space distance should not be changed so |
yes of course. I wasn't thinking about trustworthiness being calculated in
the output space. If the method can apply to a distance matrix then surely
at most the algorithm can require any true distance metric to be faithful.
(Which may exclude cosine, but at the end of the day this is usually for
visualisation so if it gives you anything it may be good enough.)
auto uses ball tree in the Euclidean case anyway.
Thanks!
…On 28 Aug 2017 7:20 pm, "Thomas Moreau" ***@***.***> wrote:
I fixed the method in nearest neighbors to ball_tree because the previous
implementation was doing so. It was directly using the BallTree
implementation. We discussed setting the method to automatic but wanted to
do some benchmarking before I think.
As a side note, you better be really careful with changing the metric. In
some sense, if you change the metric, you do not have the "t-distributed"
part of t-SNE so I am not sure how we cope with this. All the code is valid
for the use of euclidean distances in both the input and output space. The
output space distance should not be changed so dist_X_embedded =
pairwise_distances(X_embedded, squared=True) should not use the parameter
metric.
For the input space, I am not sure of what changing the metric implies. At
least, the squared=True option might not be the right parameter as for
instance, using the l1 squared metric does not seems the right way to go.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz602njG0jcRT7zRprHXpAjm-2ecDFks5scoZVgaJpZM4PBfnx>
.
|
Oh yes! I also did not think about the output space. What do you suggest now? shell I make a new commit as suggested before just without specifing the metric for the calculation of |
I think so. Test equalise to the corresponding precomputed distances,
thanks.
@tomMoral is welcome to tell me I have no clue, though.
…On 29 Aug 2017 2:21 am, "oliblum90" ***@***.***> wrote:
Oh yes! I also did not think about the output space.
What do you suggest now? shell I make a new commit as suggested before
just without specifing the metric for the calculation of dist_X_embedded ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_tS1EEDRXa52cqt1M6IuAlDUgJgks5scukZgaJpZM4PBfnx>
.
|
The metric used in the output space is the euclidean distance. This is hard coded at least in the For the input space, the metric used in the original paper is also the euclidean distance. I think changing it should be done with care. All the code have been done assuming that the metric used was the euclidean distance. We notably tried to optimized the squared/ not squared computation. So if the @oliblum90 if you have time to check the formula for |
Okay. Well we can (a) call a geometry professor; (b) deprecate the metric
parameter or limit it; (c) use this kind of fix and document more clearly
the caveat that it may not be faithful.
I had thought that for a given pairwise distance matrix and a true metric,
a set of points in sufficiently high dimensions should exist that produce
that distance matrix. But I have no knowledge of the relevant mathematics
or how to search the literature, while techniques for constructing that set
of points certainly exist for the euclidean distance.
…On 29 August 2017 at 18:44, Thomas Moreau ***@***.***> wrote:
The metric used in the output space is the euclidean distance. This is
hard coded at least in the barnes_hut implementation and this is in
accordance with the t-SNE paper. It also makes sense as the euclidean
distance is the distance we understand the best so it is well suited for
visualization.
For the input space, the metric used in the original paper is also the
euclidean distance. I think changing it should be done with care. All the
code have been done assuming that the metric used was the euclidean
distance. We notably tried to optimized the squared/ not squared
computation. So if the metric is changed, we should check that the
computed probabilities pij actually make sense.
@oliblum90 <https://github.com/oliblum90> if you have time to check the
formula for pij, I can review it. If you don't want to do it, let me know
and I will try to do it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65dsG3eol5dfI_GIsj-OvxoLCooYks5sc89YgaJpZM4PBfnx>
.
|
Ok I just checked and the current implementation is correct. It only squares the distance when we use the I would probably go for (c), document more clearly the behavior when changing the metric. |
so you're saying we shouldn't use squared distances if another metric is
provided?
…On 29 Aug 2017 11:43 pm, "Thomas Moreau" ***@***.***> wrote:
Ok I just checked and the current implementation is correct. It only
squares the distance when we use the euclidean distance.
I would probably go for (c), document more clearly the behavior when
changing the metric.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-DBq-bWUSd1CNu0iE1xfcynWO8Bks5sdBV2gaJpZM4PBfnx>
.
|
The question on whether we want different metrics is a bit separate from this regression, though, right? @tomMoral you haven't checked that all the code makes sense for different metrics, though? I guess the metric parameter was part of the initial implementation, so I would guess that everything takes it into account? |
Yes. The behavior is the same as in 0.18 with this PR. My question is just whether or not the distances should be squared if we want the method to make sense with other metric. But it is probably a topic for an other PR. |
lgtm as a regression fix |
Before merge, could we please have a test that |
@oliblum90, we would like to release 0.19.1 with this regression fixed soon. Could you please complete this or let us know if we should find another contributor. Thanks. |
@tomMoral I also suspect we need to disallow init == 'pca' for precomputed and probably other metrics. |
@jnothman The I will open an issue for the metric change in t-SNE, to discuss this specific question and the question of squaring the distance. |
@oliblum90, we'd like to release 0.19.1 soon. Could you please add a regression test for cosine at least, or pass the buck? Thanks. |
|
Ok I added you as collaborator in github. Did it work out? |
No such luck @tomMoral |
This time, it is the "exact" implementation failling. Inherently, some initialization might be bad and t-SNE is not able to pass some points over the potentiel barrier. So there is always a situation where this test can fail (statistically). A possible solution to get a more robust test is to re-run the "bad initialization". I propose that when this test fails, we try to re-run T-sne from I tried this strategy for 200 random initialization with Examples of failures: |
Is this retrying the same as doubling the number of iterations? Or is it very different? And with a single/shorter run we had numerical instability issues? Mostly we need the test to be a meaningful measure of correctness, but it's much preferable if it tests the same way on all platforms. Is there a chance we'd get more stability simply by changing the scale of the input grid or something like that?? |
The proposed solution is not the same as doubling the number of iteration because there is a second The main issue is when it converges to a bad local minima, because the initialization is not good enough. With some initializations, there is a potential barrier that forbids the correct disposition of the points and we end up with solution like the ones displayed above. Re-running it from this local minima with I did not try scaling the grid but I don't think this will help to have something more cross-platform. |
I'm pretty sure numpy.random should be cross-platform. |
except AssertionError: | ||
# If the test fails a first time, re-run with init=Y to see if | ||
# this was caused by a bad initialization | ||
tsne.init = Y |
jnothman
Sep 18, 2017
Member
Instead of this, what do you think, @tomMoral, of just having a counter of how many times the uniform grid was recovered, and checking that, say, 2 seeds of 3 were good.
Instead of this, what do you think, @tomMoral, of just having a counter of how many times the uniform grid was recovered, and checking that, say, 2 seeds of 3 were good.
tomMoral
Sep 18, 2017
•
Contributor
I would say that 3 tries are not enough for this 2 out of 3 approach (statistical). However, the test is already slow.
I think the proposed solution is more robust. It asserts that we are able to recover a 2d grid from a random initialization using t-SNE. The retry is just caused by the fact that you can have some bad initialization configuration but does not break the fact that we are converging to the right solution with an optimization schedule good enough.
I would say that 3 tries are not enough for this 2 out of 3 approach (statistical). However, the test is already slow.
I think the proposed solution is more robust. It asserts that we are able to recover a 2d grid from a random initialization using t-SNE. The retry is just caused by the fact that you can have some bad initialization configuration but does not break the fact that we are converging to the right solution with an optimization schedule good enough.
well, I'm not entirely persuaded by that test. I think it may deserve a comment that the distance ties made BH-tSNE platform dependent due to numerical imprecision (and perhaps that that perturbation to avoid ties led to failure in exact). If I've got that correctly. Otherwise, I suppose we should merge this and move on to the rest of 0.19.1. |
I agree with you, I am not convince by the test. The non-convexity makes the impact of numerical imprecision hard to evaluate. But I do not see an easy way to fix it. The re-run have the advantage to ensure that we can converge to a grid like solution. Is the comment okay like that? |
Could you address this comment? Except for it, this PR seems okay to me, the |
tsne_2 = TSNE(metric='precomputed', | ||
n_components=n_components_embedding, | ||
random_state=0).fit_transform(dist_func(X)) | ||
t = trustworthiness(tsne_1, tsne_2, n_neighbors=1) |
tomMoral
Sep 20, 2017
Contributor
The trustworthiness is designed to test the validity of the results, between the input and the output space, using the euclidean distance.
You should change this test to assert that tsne_1
and tsne_2
are the same, .
The trustworthiness is designed to test the validity of the results, between the input and the output space, using the euclidean distance.
You should change this test to assert that tsne_1
and tsne_2
are the same, .
oliblum90
Sep 21, 2017
Author
Contributor
Done
Done
I think we should get 0.19.1 out in the next week or so. Another review for merge? @amueller, @lesteve? Some background: when changing to use |
I proposed the retry mechanism based on the observation that when we increase the number of seeds used in the test, we end up with failure without this retry mechanism. On certain platforms, the method might not be concistent (because of the ties handeling in My point for the retry mechanism: the optimization is not convex and some configuration might end up in bad local minima. Retrying permits to escape these local minima because we restart from the final point of the previous try and have some |
LGTM even if I have to admit I don't understand the details of TSNE. Should we have a warning if the metric is not euclidean because we are not sure that it is even a good idea to do that according to #9623 (comment). Also a minor nitpick while I was a it. |
metrics = ['manhattan', 'cosine'] | ||
dist_funcs = [manhattan_distances, cosine_distances] | ||
for metric, dist_func in zip(metrics, dist_funcs): | ||
tsne_1 = TSNE(metric=metric, n_components=n_components_embedding, |
lesteve
Sep 28, 2017
Member
I am not a big fan of variable lazy naming, even less when they are confusing like this (tsne_1 makes you think this is TSNE estimator while it is a transformed X ...). You could do
X_transformed_tsne = TSNE(...).fit_transform(...)
X_transformed_tsne_precomputed = TSNE(...).fit_transform(...)
I am not a big fan of variable lazy naming, even less when they are confusing like this (tsne_1 makes you think this is TSNE estimator while it is a transformed X ...). You could do
X_transformed_tsne = TSNE(...).fit_transform(...)
X_transformed_tsne_precomputed = TSNE(...).fit_transform(...)
To visualize neural network features often cosine similarity is the best choise for t-sne (see here). Neural Networks become more and more important. So I think it is pretty important to include this distance metric. |
I'm pretty sure, @lesteve, that any distance matrix can be interpreted as
euclidean in a sufficiently high space. I'm not sure that that is true of
other metrics. But with something like tSNE, working in practice for
visualisation is good enough reason to support it.
The main question remains whether or not we square the distances (which
affects the probability distributions). I think we probably should, but
perhaps we can do that with warning in the future, as I'd rather this
merged for 0.19.1. WDYT?
…On 29 September 2017 at 04:05, oliblum90 ***@***.***> wrote:
To visualize neural network features often cosine similarity is the best
choise for t-sne (see here
<https://medium.com/towards-data-science/reducing-dimensionality-from-dimensionality-reduction-techniques-f658aec24dfe>).
Neural Networks become more and more important. So I think it is pretty
important to include this distance metric.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9623 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz697n8bCuk0SR4wExuwH6ozWaJXSNks5sm9_rgaJpZM4PBfnx>
.
|
I think it is the right solution. This fixes the regression and we can discuss in #9695 about the proper support of other metrics. |
Thanks @tomMoral I pushed the fix for the nitpick I had and will merge when this is green. |
Merging, thanks a lot @oliblum90 and @tomMoral! |
1701fcf
into
scikit-learn:master
set
neighbors_method
to "brute" formetric='cosine'
, as formetric="cosine"
withneighbors_method="ball_tree"
theNearestNeighbor
method raises the error:The following code snippet is working for me in version 0.18.1, but is not working for version 0.19 anymore. With the change I added the snippet works.