Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] enable metric = 'cosine' for tsne computation #9623

Merged
merged 10 commits into from Oct 2, 2017
5 changes: 1 addition & 4 deletions sklearn/manifold/t_sne.py
Expand Up @@ -711,10 +711,7 @@ def _fit(self, X, skip_num_points=0):
print("[t-SNE] Computing {} nearest neighbors...".format(k))

# Find the nearest neighbors for every point
neighbors_method = 'ball_tree'
if (self.metric == 'precomputed'):
neighbors_method = 'brute'
knn = NearestNeighbors(algorithm=neighbors_method, n_neighbors=k,
knn = NearestNeighbors(algorithm='auto', n_neighbors=k,
metric=self.metric)
t0 = time()
knn.fit(X)
Expand Down
20 changes: 20 additions & 0 deletions sklearn/manifold/tests/test_t_sne.py
Expand Up @@ -30,6 +30,8 @@
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import cosine_distances


x = np.linspace(0, 1, 10)
Expand Down Expand Up @@ -764,3 +766,21 @@ def test_bh_match_exact():
assert n_iter['exact'] == n_iter['barnes_hut']
assert_array_almost_equal(X_embeddeds['exact'], X_embeddeds['barnes_hut'],
decimal=3)


def test_tsne_with_different_distance_metrics():
"""Make sure that TSNE works for different distance metrics"""
random_state = check_random_state(0)
n_components_original = 3
n_components_embedding = 2
X = random_state.randn(50, n_components_original).astype(np.float32)
metrics = ['manhattan', 'cosine']
dist_funcs = [manhattan_distances, cosine_distances]
for metric, dist_func in zip(metrics, dist_funcs):
tsne_1 = TSNE(metric=metric, n_components=n_components_embedding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of variable lazy naming, even less when they are confusing like this (tsne_1 makes you think this is TSNE estimator while it is a transformed X ...). You could do

X_transformed_tsne = TSNE(...).fit_transform(...)
X_transformed_tsne_precomputed = TSNE(...).fit_transform(...)

random_state=0).fit_transform(X)
tsne_2 = TSNE(metric='precomputed',
n_components=n_components_embedding,
random_state=0).fit_transform(dist_func(X))
t = trustworthiness(tsne_1, tsne_2, n_neighbors=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about what you're testing here. I think we can just test that the outputs are identical if the second is precomputed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trustworthiness is designed to test the validity of the results, between the input and the output space, using the euclidean distance.
You should change this test to assert that tsne_1 and tsne_2 are the same, .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

assert_greater(t, 0.9)