Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix trustworthiness custom metric #9775

Merged
Copy path View file
@@ -8,6 +8,7 @@
# * Fast Optimization for t-SNE:
# http://cseweb.ucsd.edu/~lvdmaaten/workshops/nips2010/papers/vandermaaten.pdf

import warnings
from time import time
import numpy as np
from scipy import linalg
@@ -377,7 +378,8 @@ def _gradient_descent(objective, p0, it, n_iter,
return p, error, i


def trustworthiness(X, X_embedded, n_neighbors=5, precomputed=False):
def trustworthiness(X, X_embedded, n_neighbors=5,
precomputed=False, metric='euclidean'):
"""Expresses to what extent the local structure is retained.
The trustworthiness is within [0, 1]. It is defined as
@@ -413,15 +415,36 @@ def trustworthiness(X, X_embedded, n_neighbors=5, precomputed=False):
precomputed : bool, optional (default: False)
Set this flag if X is a precomputed square distance matrix.
..deprecated:: 0.20
``precomputed`` has been deprecated in version 0.20 and will be
removed in version 0.22. Use ``metric`` instead.
metric : string, or callable, optional, default 'euclidean'
Which metric to use for computing pairwise distances between samples
from the original input space. If metric is 'precomputed', X must be a
matrix of pairwise distances. Otherwise, see the documentation of

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2017

Member

Can you say "pairwise distances or squared distances" (either will work as this function only uses their rank, not their value).

argument metric in sklearn.pairwise.pairwise_distances for a list of
available metrics. However, using a metric different from 'euclidean'
and 'precomputed' seems not standard for this task.
Returns
-------
trustworthiness : float
Trustworthiness of the low-dimensional embedding.
"""
if precomputed:
warnings.warn("The flag 'precomputed' has been deprecated in version "
"0.20 and will be removed in 0.22. See 'metric' "
"parameter instead.", DeprecationWarning)
metric = 'precomputed'
if metric == 'precomputed':

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 19, 2017

Member

I think pairwise_distances handled precomputed, so we don't need a special case here

This comment has been minimized.

Copy link
@wdevazelhes

wdevazelhes Sep 20, 2017

Author Contributor

Thanks, I will change this

dist_X = X
elif metric == 'euclidean':
dist_X = pairwise_distances(X, metric='euclidean', squared=True)

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 19, 2017

Member

I similarly don't think it's essential to specially handle Euclidean, but at least it saves some computation

This comment has been minimized.

Copy link
@wdevazelhes

wdevazelhes Sep 20, 2017

Author Contributor

I will change this too

else:
dist_X = pairwise_distances(X, squared=True)
warnings.warn("The metric '{}' seems not standard for computing "
"trustworthiness.".format(metric), RuntimeWarning)
dist_X = pairwise_distances(X, metric=metric)
dist_X_embedded = pairwise_distances(X_embedded, squared=True)
ind_X = np.argsort(dist_X, axis=1)
ind_X_embedded = np.argsort(dist_X_embedded, axis=1)[:, 1:n_neighbors + 1]
@@ -14,6 +14,8 @@
from sklearn.utils.testing import assert_greater
from sklearn.utils.testing import assert_raises_regexp
from sklearn.utils.testing import assert_in
from sklearn.utils.testing import assert_warns
from sklearn.utils.testing import assert_raises
from sklearn.utils.testing import skip_if_32bit
from sklearn.utils import check_random_state
from sklearn.manifold.t_sne import _joint_probabilities
@@ -286,11 +288,48 @@ def test_preserve_trustworthiness_approximately_with_precomputed_distances():
early_exaggeration=2.0, metric="precomputed",
random_state=i, verbose=0)
X_embedded = tsne.fit_transform(D)
t = trustworthiness(D, X_embedded, n_neighbors=1,
precomputed=True)
t = trustworthiness(D, X_embedded, n_neighbors=1, metric="precomputed")
assert t > .95


def test_trustworthiness_precomputed_deprecation():
# NOTE: Remove this test in v0.23

# Use of the flag `precomputed` in trustworthiness parameters has been
# deprecated, but will still work until v0.23.
X = np.arange(100).reshape(50, 2)
assert_equal(assert_warns(DeprecationWarning, trustworthiness,
pairwise_distances(X), X, precomputed=True), 1.)
assert_equal(assert_warns(DeprecationWarning, trustworthiness,
pairwise_distances(X), X, metric='precomputed',
precomputed=True), 1.)
# If a 'metric' different from 'precomputed' is specified, but the flag
# 'precomputed' is set to True, the flag overwrites the parameter 'metric'
# and so the 'precomputed' metric will be used. Indeed, in
# http://scikit-learn.org/stable/developers/contributing.html#deprecation,

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2017

Member

I don't think we need this reminder... But yes, this is the what we tend to do.

# the old parameter overwrites the new parameter.
assert_raises(Exception, assert_warns, DeprecationWarning,

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2017

Member

Usually we would not test for raising something as broad as Exception. Indeed, usually we'd check for a particular error message.

trustworthiness, X, X, metric='euclidean', precomputed=True)
assert_equal(assert_warns(DeprecationWarning, trustworthiness,
pairwise_distances(X), X, metric='euclidean',
precomputed=True), 1.)


def test_trustworthiness_unusual_metric():
# Other metrics than 'euclidean' and 'precomputed' are unusual and must
# raise a warning
X = np.arange(100).reshape(50, 2)
assert_warns(RuntimeWarning, trustworthiness, X, X, metric='manhattan')

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2017

Member

I actually strongly believe we should not raise a warning here or make any statement that this is abnormal. Even if deriving the manifold from an unusual metric is weird, checking that neighborhoods are maintained in the embedded space is precisely how we determine whether the learnt manifold was appropriate.

# The example below illustrates why using a custom metric other than
# euclidean can be misleading, while testing the function in this case.
X = np.array([[1, 1], [1, 0], [2, 2]])
assert_equal(trustworthiness(X, X, n_neighbors=1, metric='euclidean'), 1.)
assert_almost_equal(assert_warns(RuntimeWarning, trustworthiness, X, X,
n_neighbors=1,
metric='cosine'),
0.66, decimal=1)

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 19, 2017

Member

.66 appears drawn from thin air. Instead benchmark against the corresponding precomputed distances



def test_early_exaggeration_too_small():
# Early exaggeration factor must be >= 1.
tsne = TSNE(early_exaggeration=0.99)
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.