Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] t-SNE #2822

Merged
merged 42 commits into from Jun 4, 2014
Merged

[MRG+1] t-SNE #2822

merged 42 commits into from Jun 4, 2014

Conversation

AlexanderFabisch
Copy link
Member

This is an implementation of (non-parametric) t-SNE for visualization.

See Laurens van der Maaten's paper or his website about t-SNE for details. In comparison to other implementations and the original paper this version has these features:

  • it is designed and optimized for Python
  • the degrees of freedom of the Student's t-distribution are determined with a heuristic
  • it has only a few parameters to control the optimization: learning_rate, n_iter and early_exaggeration, the momentum etc. are fixed and work well for most datasets

TODO

  • Implement real transform (and maybe even inverse_transform)
  • integrate in sklearn (relative imports, build with cython, etc.)
  • remove Python function calls from binary search
  • reference for the trustworthiness score
  • more parameters for gradient descent (n_iter, learning_rate, early_exaggeration)
  • find a robust learning schedule
  • refactor t-SNE so that it is possible for the user to implement parametric t-SNE
  • distances should be called affinities
  • tests
  • example (documentation, comparisons)
  • t-SNE should expose the attributes embedding_, nbrs_, training_data_, embedding_nbrs_ (similar to Isomap)
  • Remove generalization
  • Narrative documentation in doc/modules/manifold.rst
  • Integrate PCA initialization (merge + docstring)
  • Mention papers about Barnes-Hut-SNE etc. in comment in t_sne.py

Learning Schedules

In the literature:

  • original paper: initialization with standard deviation 1e-4, 1000 episodes, learning rate 100, momentum 0.5 for 250 episodes, 0.8 for the rest, early exaggeration with 4 for 50 episodes
  • matlab implementation: learning rate 500, early exaggeration for 100 episodes
  • python implementation: initialization with standard deviation 1, learning rate 500, early exaggeration for 100 episodes, momentum 0.5 for 20 episodes
  • divvy: initialization with standard deviation 1e-4, 1000 episodes, learning rate 1000, momentum 0.5 for 100 episodes, 0.8 for the rest, early exaggeration with 4 for 100 episodes
  • parametric t-sne (not comparable): conjugate gradient
  • barnes-hut t-sne: initialization with standard deviation 1e-4, 1000 episodes, learning rate 200, momentum 0.5 for 250 episodes, 0.8 for the rest, early exaggeration with 12 for 250 episodes

My experiences:

  • the learning rate has to be set manually for optimal performance, something between 100 and 1000
  • a high momentum (0.8) during early exaggeration improves the result

This implementation uses the following schedule:

  • initialization with standard deviation 1e-4, 1000 episodes, learning rate 1000, momentum 0.5 for 50 episodes, 0.8 for the rest, early exaggeration with 4 for 100 episodes

Observations

  • early compression (L2 penalty at the beginning of the optimization) did not give significant advantage in my experiments
  • L-BFGS is faster for smaller datasets and creates larger gaps between natural clusters than gradient descent in larger datasets
  • usually visualizations look better with gradient descent even though L-BFGS finds better local minima
  • binary search requires 2.3 seconds in Cython and 3.9 seconds in Python on the digits dataset

Tips

  • reducing the dimensionality of data to its first 50 principal components often results in better t-SNE visualizations
  • if the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high
  • if the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes

Examples

Visualizations of some datasets can be found here, e.g.

Digits dataset

Work for other pull requests

"""
return trustworthiness(self, X, n_neighbors=n_neighbors)

def transform(self, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just implement fit_transform and remove transform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make it impossible to use grid search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Does grid search need transform?

transform must be able to generalize to new data. Transductive transformers should implement fit and fit_transform only.

Your score method seems to be able to generalize to new data so this line should be fine:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1199

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, score doesn't generalise. There should really be a fit_score method to parallel fit_predict and fit_transform.

@mblondel
Copy link
Member

mblondel commented Feb 4, 2014

@dwf might be interested in reviewing this :)

return p, error


def trustworthiness(estimator, X, n_neighbors=5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes more sense as a function of X and X_embedded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I'd prefer the file to be named t_sne.py

And thanks for this contribution! 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tips

@jnothman
Copy link
Member

jnothman commented Feb 4, 2014

I'm not sure that this should be a Transformer, rather than a function. Had you intended to perform a grid search over its parameters?

@GaelVaroquaux
Copy link
Member

I think that it would be useful to have both the function and the transformer. The transformer is a standard in scikit-learn, but the function is also useful. 

-------- Original message --------
From: jnothman notifications@github.com
Date:04/02/2014 02:23 (GMT+01:00)
To: scikit-learn/scikit-learn scikit-learn@noreply.github.com
Subject: Re: [scikit-learn] [WIP] t-SNE (#2822)
I'm not sure that this should be a Transformer, rather than a function. Had you intended to perform a grid search over its parameters?


Reply to this email directly or view it on GitHub.

@AlexanderFabisch
Copy link
Member Author

I agree with Gael. Grid search might be very useful since t-SNE has at least the hyperparameter perplexity and maybe I will also add some parameters that control the optimizer (learning_rate, momentum, ...).

@jnothman jnothman closed this Feb 4, 2014
@jnothman jnothman reopened this Feb 4, 2014
@GaelVaroquaux
Copy link
Member

The transformer is a standard in scikit-learn
But this isn't transforming into something

OK, I agree that it shouldn't be a transformer. Just an estimator.

@jnothman
Copy link
Member

jnothman commented Feb 4, 2014

But this isn't transforming into something

Ha :P I didn't mean for the message to send like that. It's not
transforming in the sense of a pipeline, etc. But then I recalled that we
have transformers for targets as well as features...

Using GridSearchCV doesn't make sense either, because we can't do CV.

So to parallel the current API, this should really have fit() and
fit_score() (and perhaps other unsupervised, non-inductive estimators
should too).

On 4 February 2014 20:35, Gael Varoquaux notifications@github.com wrote:

The transformer is a standard in scikit-learn
But this isn't transforming into something

OK, I agree that it shouldn't be a transformer. Just an estimator.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2822#issuecomment-34043259
.

@mblondel
Copy link
Member

mblondel commented Feb 4, 2014

@dwf Do you confirm that tSNE cannot generalize to new data? (including heuristics)

@mblondel
Copy link
Member

mblondel commented Feb 4, 2014

Using GridSearchCV doesn't make sense either, because we can't do CV.

I think @AlexanderFabisch wants to use CV to obtain a trustworthiness score on unseen data (the unseen data doesn't need to be transformed to obtain that score). What I don't understand is why score can generalize to new data but not transform. Does tSNE optimize for a different criterion from the one implemented in score?

@mblondel
Copy link
Member

mblondel commented Feb 4, 2014

For unsupervised probabilistic models, I think it is sometimes possible to do model selection by maximizing the likelihood of the training data but we don't have mechanisms for that in scikit-learn.

@AlexanderFabisch
Copy link
Member Author

I think there is currently no other method available than parametric t-SNE
which builds on a stack of RBMs and is very difficult to tune because there
are so many hyperparameters. I could implement something like an average of
nearest neighbors. That could also be used for inverse_transform.

Am 04.02.2014 12:20 schrieb "Mathieu Blondel" notifications@github.com:

@dwf https://github.com/dwf Do you confirm that tSNE cannot generalize to
new data? (including heuristics)

Reply to this email directly or view it on GitHub.[image]

@AlexanderFabisch
Copy link
Member Author

T-SNE cannot generalize at all in this implementation. We can only optimize
the training score. T-SNE creates two distribution P an Q based on the
distances of samples in the original and the embedded space and minimizes
the Kulback-Leibler divergence of both. The trustworthiness tells us how
well neighbors are preserved in the embedded space.

Am 04.02.2014 12:27 schrieb "Mathieu Blondel" notifications@github.com:

Using GridSearchCV doesn't make sense either, because we can't do CV.

I think @AlexanderFabisch https://github.com/AlexanderFabisch wants to
use CV to obtain a trustworthiness score on unseen data (the unseen data
doesn't need to be transformed to obtain that score). What I don't
understand is why score can generalize to new data but not transform. Does
tSNE optimize for a different criterion from the one implemented in score?

Reply to this email directly or view it on GitHub.[image]

@GaelVaroquaux
Copy link
Member

I could implement something like an average of nearest neighbors. That
could also be used for inverse_transform.

It might be worth trying.

@amueller
Copy link
Member

amueller commented Feb 4, 2014

I think we used the term "transformer" pretty loosely in the context of the manifold module. Spectral embedding doesn't have a transform but also inherits from TransformerMixin. I had to check to actually see that LLE and Isomap have transform methods.

I don't think we should implement non-standard hacks in transform and inverse_transform.
I would definitely make this an Estimator and possibly inherit from TransformerMixin. Currently each Estimator inherits from one of the four base mixins ClusterMixin, ClassifierMixin, TransformerMixin or RegressorMixin --- except the algorithm working on labels and the meta-estimators.

I don't think we should tie the decision whether this is a class or not to whether we want to be able to GridSearch it. I never use any sklearn function, because the estimator have such a nice interface.

@amueller
Copy link
Member

amueller commented Feb 4, 2014

Also: wohoo, t-SNE!

from scipy.optimize import fmin_l_bfgs_b
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
import binary_search
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use relative import

@AlexanderFabisch
Copy link
Member Author

It seems like there is a huge interest in having this in sklearn. ;) I think this will take a while but I think it is worth it.

@mblondel
Copy link
Member

mblondel commented Feb 5, 2014

I could implement something like an average of nearest neighbors

I think you could reuse the same approach for SpectralEmbedding (which only implements fit and fit_transform too, by the way). This can be done in another PR.

@mblondel
Copy link
Member

mblondel commented Feb 5, 2014

T-SNE cannot generalize at all in this implementation

Then, how are you planning to use grid search?!

@jnothman
Copy link
Member

jnothman commented Feb 5, 2014

Then, how are you planning to use grid search?!

Could use cv=[(arange(X.shape[0]), arange(X.shape[0]))]?

On 5 February 2014 12:58, Mathieu Blondel notifications@github.com wrote:

T-SNE cannot generalize at all in this implementation

Then, how are you planning to use grid search?!

Reply to this email directly or view it on GitHubhttps://github.com//pull/2822#issuecomment-34130662
.

if self.distances == "precomputed" and X.shape[0] != X.shape[1]:
raise ValueError("X should be a square distance matrix")

self.Y_ = self._tsne(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this attribute be better called embedding_ as it is in SpectralEmbedding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I it has been called embedding_ previously. :) I will change that again.

@mblondel
Copy link
Member

mblondel commented Feb 5, 2014

Could use cv=[(arange(X.shape[0]), arange(X.shape[0]))]?

Indeed that would work in the current state of this PR but score is supposed to generalize to new data. In the future, we will need a way to do model selection for such "transductive" algorithms. That would be a useful addition to scikit-learn.

@AlexanderFabisch
Copy link
Member Author

@ogrisel The branch has been rebased on master.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2014

This looks good to me. Shall we merge?

@GaelVaroquaux
Copy link
Member

This looks good to me. Shall we merge?

👍!

@agramfort
Copy link
Member

+1 on my side too

@mblondel
Copy link
Member

mblondel commented Jun 4, 2014

+1

Thanks for putting up with our nitpicking @AlexanderFabisch :b

@AlexanderFabisch
Copy link
Member Author

I have to thank you and all the other reviewers, in particular @ogrisel and @GaelVaroquaux . Your comments really improved the quality of the code! ;)

agramfort added a commit that referenced this pull request Jun 4, 2014
@agramfort agramfort merged commit 0a4ba72 into scikit-learn:master Jun 4, 2014
@GaelVaroquaux
Copy link
Member

Merged #2822.

Hurray! 🍻

@amueller
Copy link
Member

Was there no whatsnew for this or am I blind? Wasn't this one of the highlights of 0.15?

@AlexanderFabisch
Copy link
Member Author

No, I can't find it either.

@amueller
Copy link
Member

Do you want to add it? I feel whatsnew is a good way to check when something was added.

@AlexanderFabisch
Copy link
Member Author

OK, should I open a pull request for that or is it possible to commit that directly to master?

@GaelVaroquaux
Copy link
Member

OK, should I open a pull request for that or is it possible to commit that
directly to master?

I think that you can commit this directly to master. Thanks!

@AlexanderFabisch
Copy link
Member Author

Was there no whatsnew for this or am I blind? Wasn't this one of the highlights of 0.15?

I added an entry to the list of highlights and one to the list of new features with bdcea50

@AlexanderFabisch
Copy link
Member Author

@GaelVaroquaux travis complains about a failing unit test. I can't see how this is related to my commit. Do you have any idea?

ERROR: sklearn.tests.test_common.test_regressors('OrthogonalMatchingPursuitCV', <class 'sklearn.linear_model.omp.OrthogonalMatchingPursuitCV'>)

----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/miniconda/envs/testenv/lib/python2.6/site-packages/nose/case.py", line 197, in runTest

self.test(*self.arg)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/estimator_checks.py", line 880, in check_regressor_data_not_an_array

check_estimators_data_not_an_array(name, Estimator, X, y)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/estimator_checks.py", line 901, in check_estimators_data_not_an_array

estimator_1.fit(X_, y_)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 817, in fit

for train, test in cv)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/joblib/parallel.py", line 659, in __call__

self.dispatch(function, args, kwargs)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/joblib/parallel.py", line 406, in dispatch

job = ImmediateApply(func, args, kwargs)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/joblib/parallel.py", line 140, in __init__

self.results = func(*args, **kwargs)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 711, in _omp_path_residues

return_path=True)

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 376, in orthogonal_mp

copy_X=copy_X, return_path=return_path

File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/omp.py", line 110, in _cholesky_omp

**solve_triangular_args)

File "/home/travis/miniconda/envs/testenv/lib/python2.6/site-packages/scipy/linalg/basic.py", line 137, in solve_triangular

a1, b1 = map(np.asarray_chkfinite,(a,b))

File "/home/travis/miniconda/envs/testenv/lib/python2.6/site-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite

"array must not contain infs or NaNs")

ValueError: array must not contain infs or NaNs

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux travis complains about a failing unit test. I can't see how this is related to my commit. Do you have any idea?

Heisenbug, maybe? I restarted the travis job. We'll see what it gives.

@GaelVaroquaux
Copy link
Member

Heisenbug, maybe? I restarted the travis job. We'll see what it gives.

Yup. Heisenbug...

@AlexanderFabisch
Copy link
Member Author

Thanks for checking.

@GaelVaroquaux
Copy link
Member

Well, thanks for mentioning that something broke. That's important!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants