Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix for t-SNE (issue #3526) #3532

Closed
wants to merge 2 commits into from
Closed

bug fix for t-SNE (issue #3526) #3532

wants to merge 2 commits into from

Conversation

makokal
Copy link
Contributor

@makokal makokal commented Aug 5, 2014

Some of the pairwise distances do not support the additional squared parameter. I suggest using sqeuclidian and such whenever this is required.

Some of the pairwise distances do not support the additional `squared` parameter. I suggest using `sqeuclidian` and such whenever this is required.
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b631402 on makokal:master into 0a7bef6 on scikit-learn:master.

@jnothman
Copy link
Member

jnothman commented Aug 5, 2014

As someone who doesn't know this code well:

I assume with this change the default metric should be changed to "sqeuclidian". Perhaps PAIRWISE_DISTANCE_FUNCTIONS should also include 'sqeuclidean': partial(euclidean_distances, squared=True) to benefit from the internal implementation (assuming there are benefits to the internal implementation over scipy.spatial).

Should the calls to pairwise_distances in trustworthiness be using the same metric?

@makokal
Copy link
Contributor Author

makokal commented Aug 5, 2014

I agree, default metric should then be `sqeuclidean'

@jnothman
Copy link
Member

jnothman commented Aug 7, 2014

It's a little disturbing that tests succeed without that change. Could you change the default to sqeuclidean. Could you please also add a non-regression test that uses a pairwise distance without squared support.

@larsmans
Copy link
Member

larsmans commented Aug 7, 2014

When was the t-SNE code merged in anyway? Before or after 0.15?

@larsmans
Copy link
Member

larsmans commented Aug 7, 2014

Oh, it's in 0.15...

@jnothman The scipy.spatial code is way slower than ours in high-d cases.

@jnothman
Copy link
Member

jnothman commented Aug 7, 2014

Well, shouldn't it be simple to implement sqeuclidean as a name for our
implementation?

On 8 August 2014 02:32, Lars Buitinck notifications@github.com wrote:

Oh, it's in 0.15...

@jnothman https://github.com/jnothman The scipy.spatial code is way
slower than ours in high-d cases.


Reply to this email directly or view it on GitHub
#3532 (comment)
.

@mblondel mblondel changed the title bug fix for issue #3526 bug fix for t-SNE (issue #3526) Aug 12, 2014
@mblondel
Copy link
Member

The name has already been discussed. We decided to use the name euclidean for consistency with AffinityPropagation and other classes. I would just do

if self.metric == "euclidean":
    distances = pairwise_distances(X, metric=self.metric, squared=True)
else:
    distances = pairwise_distances(X, metric=self.metric)

Or you can implement a filter_params option like the one in pairwise_kernels but this is a bit more work.

Also you need to add a non-regression test (a test which shows that other metrics than the default one work as expected).

@jnothman
Copy link
Member

The name has already been discussed. We decided to use the name euclidean for consistency with AffinityPropagation and other classes.

Do you mean it's been decided that 'sqeuclidean' shouldn't be mapped to the scikit-learn implementation?

@mblondel
Copy link
Member

In AffinityPropagation, the metric / affinity is called "euclidean" even though the distances are squared:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/affinity_propagation_.py#L282

We chose to do the same for consistency.

In general, I don't think it would be useful to add sqeuclidean to the list of supported metrics of pairwise_distances because the argmin of sqeuclidean and euclidean are the same. For t-SNE, I'm not sure whether sqeuclidean and euclidean might result in different embeddings. @AlexanderFabisch could you comment why you chose to use squared=True? Perhaps this is what the original paper uses? Or is just a computational saving?

Either way we need to choose one of the following three options:

    1. We support both sqeuclidean and euclidean
    1. We support only euclidean and we use squared=True like AffinityPropagation
    1. We support only euclidean and we use squared=False as this PR does (squared=False by default in euclidean_distances)

@AlexanderFabisch
Copy link
Member

  • The original paper uses squared euclidean distances.
  • sqeuclidean and euclidean will result in different embeddings, because not only the ranks but also the magnitude of distances are important for t-SNE
  • I would vote for option 2. You could pass a precomputed distance matrix if you want to use another metric.

@jnothman
Copy link
Member

Thanks @AlexanderFabisch https://github.com/AlexanderFabisch. Could you
please also confirm that the other places in the code where euclidean is
hardcoded, it should be?

On 13 August 2014 19:13, Alexander Fabisch notifications@github.com wrote:

  • The original paper
    http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
    uses squared euclidean distances.
  • sqeuclidean and euclidean will result in different embeddings,
    because not only the ranks but also the magnitude of distances are
    important for t-SNE
  • I would vote for option 2. You could pass a precomputed distance
    matrix if you want to use another metric.


Reply to this email directly or view it on GitHub
#3532 (comment)
.

@AlexanderFabisch
Copy link
Member

  • _fit - correct
  • _kl_divergence - these must be squared euclidean distances because we compute probabilities of the Student's t-distribution here
  • trustworthiness - only the ranking is relevant, you could remove squared=True, which should not make a difference

Did I miss something?

@AlexanderFabisch
Copy link
Member

Regarding the speed of pdist vs. pairwise_distances: we did some benchmarks. The result is: we use pairwise_distances for the original data (X) because X might be high-dimensional or even sparse and we use pdist in the embedded space because all points in the embedded space are dense and have only 2-3 dimensions.

@jnothman
Copy link
Member

Thanks, that makes sense.

On 13 August 2014 21:22, Alexander Fabisch notifications@github.com wrote:

Regarding the speed of pdist vs. pairwise_distances: we did some
benchmarks #2822. The
result is: we use pairwise_distances for the original data (X) because X
might be high-dimensional or even sparse and we use pdist in the embedded
space because all points in the embedded space are dense and have only 2-3
dimensions.


Reply to this email directly or view it on GitHub
#3532 (comment)
.

@jnothman
Copy link
Member

@makokal are you able to follow through with @mblondel's recommendation?

@makokal
Copy link
Contributor Author

makokal commented Aug 14, 2014

@jnothman Yes, I will redo the pull request shortly, thanks @mblondel

New fix following the discussion on the previous pull request. Thanks to @jnothman,  @mblondel and others ..
@jnothman
Copy link
Member

A non-regression test would be appropriate too.

@ogrisel
Copy link
Member

ogrisel commented Sep 1, 2014

+1 for a new test as well.

@@ -431,8 +431,12 @@ def _fit(self, X):
distances = X
else:
if self.verbose:
print("[t-SNE] Computing pairwise distances...")
distances = pairwise_distances(X, metric=self.metric, squared=True)
print("[t-SNE] Computing pairwise distances...")\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please remove the trailing \.

@AlexanderFabisch
Copy link
Member

Are you going to finish this @makokal or can I help you?

@makokal
Copy link
Contributor Author

makokal commented Oct 13, 2014

@AlexanderFabisch Sorry been tied up much lately, you can go ahead a finish it up. Much appreciated

@AlexanderFabisch
Copy link
Member

I added a test and removed the trailing '' in https://github.com/AlexanderFabisch/scikit-learn/tree/makokal-master . Before I open another pull request: is everyone happy with this solution or should we find a different solution or should we document this in the docstring?

@makokal
Copy link
Contributor Author

makokal commented Oct 13, 2014

Looks fine with me

@larsmans
Copy link
Member

Superseded by #3786. Thanks!

@larsmans larsmans closed this Oct 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants