bug fix for t-SNE (issue #3526) #3532

makokal · 2014-08-05T11:57:15Z

Some of the pairwise distances do not support the additional squared parameter. I suggest using sqeuclidian and such whenever this is required.

Some of the pairwise distances do not support the additional `squared` parameter. I suggest using `sqeuclidian` and such whenever this is required.

coveralls · 2014-08-05T12:08:37Z

Coverage remained the same when pulling b631402 on makokal:master into 0a7bef6 on scikit-learn:master.

jnothman · 2014-08-05T12:15:21Z

As someone who doesn't know this code well:

I assume with this change the default metric should be changed to "sqeuclidian". Perhaps PAIRWISE_DISTANCE_FUNCTIONS should also include 'sqeuclidean': partial(euclidean_distances, squared=True) to benefit from the internal implementation (assuming there are benefits to the internal implementation over scipy.spatial).

Should the calls to pairwise_distances in trustworthiness be using the same metric?

makokal · 2014-08-05T17:55:16Z

I agree, default metric should then be `sqeuclidean'

jnothman · 2014-08-07T11:23:53Z

It's a little disturbing that tests succeed without that change. Could you change the default to sqeuclidean. Could you please also add a non-regression test that uses a pairwise distance without squared support.

larsmans · 2014-08-07T16:18:54Z

When was the t-SNE code merged in anyway? Before or after 0.15?

larsmans · 2014-08-07T16:32:24Z

Oh, it's in 0.15...

@jnothman The scipy.spatial code is way slower than ours in high-d cases.

jnothman · 2014-08-07T21:47:13Z

Well, shouldn't it be simple to implement sqeuclidean as a name for our
implementation?

On 8 August 2014 02:32, Lars Buitinck notifications@github.com wrote:

Oh, it's in 0.15...

@jnothman https://github.com/jnothman The scipy.spatial code is way
slower than ours in high-d cases.

—
Reply to this email directly or view it on GitHub
#3532 (comment)
.

mblondel · 2014-08-12T14:40:34Z

The name has already been discussed. We decided to use the name euclidean for consistency with AffinityPropagation and other classes. I would just do

if self.metric == "euclidean":
    distances = pairwise_distances(X, metric=self.metric, squared=True)
else:
    distances = pairwise_distances(X, metric=self.metric)

Or you can implement a filter_params option like the one in pairwise_kernels but this is a bit more work.

Also you need to add a non-regression test (a test which shows that other metrics than the default one work as expected).

jnothman · 2014-08-13T06:17:08Z

The name has already been discussed. We decided to use the name euclidean for consistency with AffinityPropagation and other classes.

Do you mean it's been decided that 'sqeuclidean' shouldn't be mapped to the scikit-learn implementation?

mblondel · 2014-08-13T07:47:02Z

In AffinityPropagation, the metric / affinity is called "euclidean" even though the distances are squared:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/affinity_propagation_.py#L282

We chose to do the same for consistency.

In general, I don't think it would be useful to add sqeuclidean to the list of supported metrics of pairwise_distances because the argmin of sqeuclidean and euclidean are the same. For t-SNE, I'm not sure whether sqeuclidean and euclidean might result in different embeddings. @AlexanderFabisch could you comment why you chose to use squared=True? Perhaps this is what the original paper uses? Or is just a computational saving?

Either way we need to choose one of the following three options:

1. We support both sqeuclidean and euclidean
1. We support only euclidean and we use squared=True like AffinityPropagation
1. We support only euclidean and we use squared=False as this PR does (squared=False by default in euclidean_distances)

AlexanderFabisch · 2014-08-13T09:13:13Z

The original paper uses squared euclidean distances.
sqeuclidean and euclidean will result in different embeddings, because not only the ranks but also the magnitude of distances are important for t-SNE
I would vote for option 2. You could pass a precomputed distance matrix if you want to use another metric.

jnothman · 2014-08-13T10:05:01Z

Thanks @AlexanderFabisch https://github.com/AlexanderFabisch. Could you
please also confirm that the other places in the code where euclidean is
hardcoded, it should be?

On 13 August 2014 19:13, Alexander Fabisch notifications@github.com wrote:

The original paper
http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
uses squared euclidean distances.

sqeuclidean and euclidean will result in different embeddings,
because not only the ranks but also the magnitude of distances are
important for t-SNE

I would vote for option 2. You could pass a precomputed distance
matrix if you want to use another metric.

—
Reply to this email directly or view it on GitHub
#3532 (comment)
.

AlexanderFabisch · 2014-08-13T11:14:54Z

_fit - correct
_kl_divergence - these must be squared euclidean distances because we compute probabilities of the Student's t-distribution here
trustworthiness - only the ranking is relevant, you could remove squared=True, which should not make a difference

Did I miss something?

AlexanderFabisch · 2014-08-13T11:22:03Z

Regarding the speed of pdist vs. pairwise_distances: we did some benchmarks. The result is: we use pairwise_distances for the original data (X) because X might be high-dimensional or even sparse and we use pdist in the embedded space because all points in the embedded space are dense and have only 2-3 dimensions.

jnothman · 2014-08-13T12:20:56Z

Thanks, that makes sense.

On 13 August 2014 21:22, Alexander Fabisch notifications@github.com wrote:

Regarding the speed of pdist vs. pairwise_distances: we did some
benchmarks #2822. The
result is: we use pairwise_distances for the original data (X) because X
might be high-dimensional or even sparse and we use pdist in the embedded
space because all points in the embedded space are dense and have only 2-3
dimensions.

—
Reply to this email directly or view it on GitHub
#3532 (comment)
.

jnothman · 2014-08-14T03:55:57Z

@makokal are you able to follow through with @mblondel's recommendation?

makokal · 2014-08-14T10:36:04Z

@jnothman Yes, I will redo the pull request shortly, thanks @mblondel

@jnothman

New fix following the discussion on the previous pull request. Thanks to @jnothman, @mblondel and others ..

jnothman · 2014-08-14T11:47:33Z

A non-regression test would be appropriate too.

ogrisel · 2014-09-01T12:00:56Z

+1 for a new test as well.

ogrisel · 2014-09-01T14:37:58Z

sklearn/manifold/t_sne.py

@@ -431,8 +431,12 @@ def _fit(self, X):
            distances = X
        else:
            if self.verbose:
-                print("[t-SNE] Computing pairwise distances...")
-            distances = pairwise_distances(X, metric=self.metric, squared=True)
+                print("[t-SNE] Computing pairwise distances...")\


Also please remove the trailing \.

AlexanderFabisch · 2014-10-13T13:52:54Z

Are you going to finish this @makokal or can I help you?

makokal · 2014-10-13T13:57:46Z

@AlexanderFabisch Sorry been tied up much lately, you can go ahead a finish it up. Much appreciated

AlexanderFabisch · 2014-10-13T14:26:59Z

I added a test and removed the trailing '' in https://github.com/AlexanderFabisch/scikit-learn/tree/makokal-master . Before I open another pull request: is everyone happy with this solution or should we find a different solution or should we document this in the docstring?

makokal · 2014-10-13T14:36:01Z

Looks fine with me

larsmans · 2014-10-21T16:40:07Z

Superseded by #3786. Thanks!

bug fix for issue #3526

b631402

Some of the pairwise distances do not support the additional `squared` parameter. I suggest using `sqeuclidian` and such whenever this is required.

mblondel changed the title ~~bug fix for issue #3526~~ bug fix for t-SNE (issue #3526) Aug 12, 2014

bug fix for t-SNE (issue #3526) with new inputs

e0cb91d

New fix following the discussion on the previous pull request. Thanks to @jnothman, @mblondel and others ..

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

ogrisel reviewed Sep 1, 2014
View reviewed changes

AlexanderFabisch mentioned this pull request Oct 20, 2014

[MRG+1] Fix t-SNE with "non-squarable" metric #3786

Merged

larsmans closed this Oct 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fix for t-SNE (issue #3526) #3532

bug fix for t-SNE (issue #3526) #3532

makokal commented Aug 5, 2014

coveralls commented Aug 5, 2014

jnothman commented Aug 5, 2014

makokal commented Aug 5, 2014

jnothman commented Aug 7, 2014

larsmans commented Aug 7, 2014

larsmans commented Aug 7, 2014

jnothman commented Aug 7, 2014

mblondel commented Aug 12, 2014

jnothman commented Aug 13, 2014

mblondel commented Aug 13, 2014

AlexanderFabisch commented Aug 13, 2014

jnothman commented Aug 13, 2014

AlexanderFabisch commented Aug 13, 2014

AlexanderFabisch commented Aug 13, 2014

jnothman commented Aug 13, 2014

jnothman commented Aug 14, 2014

makokal commented Aug 14, 2014

jnothman commented Aug 14, 2014

ogrisel commented Sep 1, 2014

ogrisel Sep 1, 2014

AlexanderFabisch commented Oct 13, 2014

makokal commented Oct 13, 2014

AlexanderFabisch commented Oct 13, 2014

makokal commented Oct 13, 2014

larsmans commented Oct 21, 2014

bug fix for t-SNE (issue #3526) #3532

bug fix for t-SNE (issue #3526) #3532

Conversation

makokal commented Aug 5, 2014

coveralls commented Aug 5, 2014

jnothman commented Aug 5, 2014

makokal commented Aug 5, 2014

jnothman commented Aug 7, 2014

larsmans commented Aug 7, 2014

larsmans commented Aug 7, 2014

jnothman commented Aug 7, 2014

mblondel commented Aug 12, 2014

jnothman commented Aug 13, 2014

mblondel commented Aug 13, 2014

AlexanderFabisch commented Aug 13, 2014

jnothman commented Aug 13, 2014

AlexanderFabisch commented Aug 13, 2014

AlexanderFabisch commented Aug 13, 2014

jnothman commented Aug 13, 2014

jnothman commented Aug 14, 2014

makokal commented Aug 14, 2014

jnothman commented Aug 14, 2014

ogrisel commented Sep 1, 2014

ogrisel Sep 1, 2014

Choose a reason for hiding this comment

AlexanderFabisch commented Oct 13, 2014

makokal commented Oct 13, 2014

AlexanderFabisch commented Oct 13, 2014

makokal commented Oct 13, 2014

larsmans commented Oct 21, 2014