Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Issue 1453: MDS: fall back to SVD when the full similarity matrix is known #3141

Open
wants to merge 2 commits into from

2 participants

Florian Wilhelm Olivier Grisel
Florian Wilhelm

The SVD method can now be used to calculate the metric MDS. In order to do this a new argument method which defaults to smacof and can be set to svd was introduced in the MDS estimator.
This was done in order to fix issue #1453.

Olivier Grisel ogrisel commented on the diff
sklearn/manifold/mds.py
((16 lines not shown))
+ """
+
+ similarities, = check_arrays(similarities, sparse_format='dense')
+ n_samples = similarities.shape[0]
+
+ if similarities.shape[0] != similarities.shape[1]:
+ raise ValueError("similarities must be a square array (shape=%d)" %
+ n_samples)
+ if not np.allclose(similarities, similarities.T):
+ raise ValueError("similarities must be symmetric")
+
+ H = np.eye(*similarities.shape) - 1./n_samples*np.ones(similarities.shape)
+ K = -0.5*np.dot(H, np.dot(similarities**2, H))
+ w, V = np.linalg.eig(K)
+ # Sort eigenvalues and eigenvectors in decreasing order
+ ix = np.argsort(w)[::-1]
Olivier Grisel Owner
ogrisel added a note

w is already sorted by default, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Olivier Grisel ogrisel commented on the diff
sklearn/manifold/mds.py
((14 lines not shown))
+ number of dimension in which to immerse the similarities
+ overridden if initial array is provided.
+ """
+
+ similarities, = check_arrays(similarities, sparse_format='dense')
+ n_samples = similarities.shape[0]
+
+ if similarities.shape[0] != similarities.shape[1]:
+ raise ValueError("similarities must be a square array (shape=%d)" %
+ n_samples)
+ if not np.allclose(similarities, similarities.T):
+ raise ValueError("similarities must be symmetric")
+
+ H = np.eye(*similarities.shape) - 1./n_samples*np.ones(similarities.shape)
+ K = -0.5*np.dot(H, np.dot(similarities**2, H))
+ w, V = np.linalg.eig(K)
Olivier Grisel Owner
ogrisel added a note

Better use the linalg package from scipy: scipy has a mandatory requirement on an optimized BLAS / LAPACK. This is not the case fo numpy that can therefore be much slower.

Olivier Grisel Owner
ogrisel added a note

BTW: I don't understand why the method is called SVD if we don't internally use an SVD solver but instead use an eigen decomposition solver. Do you have a reference that motivates this way of implementing the "svd method for MDS"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Olivier Grisel ogrisel commented on the diff
sklearn/manifold/tests/test_mds.py
@@ -59,3 +61,63 @@ def test_MDS():
[4, 2, 1, 0]])
mds_clf = mds.MDS(metric=False, n_jobs=3, dissimilarity="precomputed")
mds_clf.fit(sim)
+
+
+def test_svd_mds():
+ # Generate 4 randomly chosen points
+ Y = np.array([[1, 0, 1],
+ [-1, 3, 2],
+ [1, -2, 3],
+ [2, -1, -3]])
+ sim = euclidean_distances(Y)
+ # calculate error or smacof-based solution
+ X_smacof, smacof_stress = mds.smacof(sim, n_components=2, random_state=42)
+ X_svd, svd_stress = mds.svd_mds(sim, n_components=2)
+ assert_less(svd_stress, smacof_stress)
Olivier Grisel Owner
ogrisel added a note

Can you please put an inline comment that explains why this should always be the case?

Olivier Grisel Owner
ogrisel added a note

Should X_smacof and X_svd be approximately be equal up to a sign flip? If so you could check the equality of the component wise absolute values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Olivier Grisel ogrisel commented on the diff
sklearn/manifold/mds.py
((23 lines not shown))
+ n_samples)
+ if not np.allclose(similarities, similarities.T):
+ raise ValueError("similarities must be symmetric")
+
+ H = np.eye(*similarities.shape) - 1./n_samples*np.ones(similarities.shape)
+ K = -0.5*np.dot(H, np.dot(similarities**2, H))
+ w, V = np.linalg.eig(K)
+ # Sort eigenvalues and eigenvectors in decreasing order
+ ix = np.argsort(w)[::-1]
+ w = w[ix]
+ V = V[:, ix]
+ if not np.all(w >= -1e-12):
+ raise ValueError("similarities must be euclidean")
+ X = np.sqrt(w[:n_components])*V[:, :n_components]
+ dists = euclidean_distances(X)
+ stress = ((similarities.ravel() - dists.ravel()) ** 2).sum() / 2
Olivier Grisel Owner
ogrisel added a note

You should use sklearn.utils.extmath.norm instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Olivier Grisel ogrisel commented on the diff
sklearn/manifold/mds.py
@@ -257,6 +257,44 @@ def smacof(similarities, metric=True, n_components=2, init=None, n_init=8,
return best_pos, best_stress
+def svd_mds(similarities, n_components=2):
+ """
+ Computes multidimensional scaling using SVD algorithm
+
+ Parameters
+ ----------
+ similarities : symmetric ndarray, shape (n_samples, n_samples)
+ similarities between the points
+
+ n_components : int, optional, default: 2
+ number of dimension in which to immerse the similarities
+ overridden if initial array is provided.
+ """
+
+ similarities, = check_arrays(similarities, sparse_format='dense')
Olivier Grisel Owner
ogrisel added a note

This will have to rebased on top of master and replaced by a call to check_array, see:

http://scikit-learn.org/stable/developers/utilities.html#validation-tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 8, 2014
  1. Florian Wilhelm
Commits on May 10, 2014
  1. Florian Wilhelm
This page is out of date. Refresh to see the latest.
Showing with 120 additions and 7 deletions.
  1. +58 −7 sklearn/manifold/mds.py
  2. +62 −0 sklearn/manifold/tests/test_mds.py
65 sklearn/manifold/mds.py
View
@@ -257,6 +257,44 @@ def smacof(similarities, metric=True, n_components=2, init=None, n_init=8,
return best_pos, best_stress
+def svd_mds(similarities, n_components=2):
+ """
+ Computes multidimensional scaling using SVD algorithm
+
+ Parameters
+ ----------
+ similarities : symmetric ndarray, shape (n_samples, n_samples)
+ similarities between the points
+
+ n_components : int, optional, default: 2
+ number of dimension in which to immerse the similarities
+ overridden if initial array is provided.
+ """
+
+ similarities, = check_arrays(similarities, sparse_format='dense')
Olivier Grisel Owner
ogrisel added a note

This will have to rebased on top of master and replaced by a call to check_array, see:

http://scikit-learn.org/stable/developers/utilities.html#validation-tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ n_samples = similarities.shape[0]
+
+ if similarities.shape[0] != similarities.shape[1]:
+ raise ValueError("similarities must be a square array (shape=%d)" %
+ n_samples)
+ if not np.allclose(similarities, similarities.T):
+ raise ValueError("similarities must be symmetric")
+
+ H = np.eye(*similarities.shape) - 1./n_samples*np.ones(similarities.shape)
+ K = -0.5*np.dot(H, np.dot(similarities**2, H))
+ w, V = np.linalg.eig(K)
Olivier Grisel Owner
ogrisel added a note

Better use the linalg package from scipy: scipy has a mandatory requirement on an optimized BLAS / LAPACK. This is not the case fo numpy that can therefore be much slower.

Olivier Grisel Owner
ogrisel added a note

BTW: I don't understand why the method is called SVD if we don't internally use an SVD solver but instead use an eigen decomposition solver. Do you have a reference that motivates this way of implementing the "svd method for MDS"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ # Sort eigenvalues and eigenvectors in decreasing order
+ ix = np.argsort(w)[::-1]
Olivier Grisel Owner
ogrisel added a note

w is already sorted by default, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ w = w[ix]
+ V = V[:, ix]
+ if not np.all(w >= -1e-12):
+ raise ValueError("similarities must be euclidean")
+ X = np.sqrt(w[:n_components])*V[:, :n_components]
+ dists = euclidean_distances(X)
+ stress = ((similarities.ravel() - dists.ravel()) ** 2).sum() / 2
Olivier Grisel Owner
ogrisel added a note

You should use sklearn.utils.extmath.norm instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ return X, stress
+
+
class MDS(BaseEstimator):
"""Multidimensional scaling
@@ -266,6 +304,10 @@ class MDS(BaseEstimator):
compute metric or nonmetric SMACOF (Scaling by Majorizing a
Complicated Function) algorithm
+ method: string, optional, default: smacof
+ Methods for solving the MDS problem are "smacof" and "svd".
+ If SVD is used, the parameter ``metric`` must be set to True.
+
n_components : int, optional, default: 2
number of dimension in which to immerse the similarities
overridden if initial array is provided.
@@ -328,10 +370,12 @@ class MDS(BaseEstimator):
"""
def __init__(self, n_components=2, metric=True, n_init=4,
max_iter=300, verbose=0, eps=1e-3, n_jobs=1,
- random_state=None, dissimilarity="euclidean"):
+ random_state=None, dissimilarity="euclidean",
+ method="smacof"):
self.n_components = n_components
self.dissimilarity = dissimilarity
self.metric = metric
+ self.method = method
self.n_init = n_init
self.max_iter = max_iter
self.eps = eps
@@ -387,10 +431,17 @@ def fit_transform(self, X, init=None, y=None):
raise ValueError("Proximity must be 'precomputed' or 'euclidean'."
" Got %s instead" % str(self.dissimilarity))
- self.embedding_, self.stress_ = smacof(
- self.dissimilarity_matrix_, metric=self.metric,
- n_components=self.n_components, init=init, n_init=self.n_init,
- n_jobs=self.n_jobs, max_iter=self.max_iter, verbose=self.verbose,
- eps=self.eps, random_state=self.random_state)
-
+ if self.method == "smacof":
+ self.embedding_, self.stress_ = smacof(
+ self.dissimilarity_matrix_, metric=self.metric,
+ n_components=self.n_components, init=init, n_init=self.n_init,
+ n_jobs=self.n_jobs, max_iter=self.max_iter, eps=self.eps,
+ verbose=self.verbose, random_state=self.random_state)
+ elif self.method == "svd":
+ if not self.metric:
+ raise ValueError("Using SVD requires metric=True")
+ self.embedding_, self.stress_ = svd_mds(
+ self.dissimilarity_matrix_, n_components=self.n_components)
+ else:
+ raise ValueError("Method %s is unknown" % str(self.method))
return self.embedding_
62 sklearn/manifold/tests/test_mds.py
View
@@ -3,6 +3,8 @@
from nose.tools import assert_raises
from sklearn.manifold import mds
+from sklearn.metrics import euclidean_distances
+from sklearn.utils.testing import assert_less
def test_smacof():
@@ -59,3 +61,63 @@ def test_MDS():
[4, 2, 1, 0]])
mds_clf = mds.MDS(metric=False, n_jobs=3, dissimilarity="precomputed")
mds_clf.fit(sim)
+
+
+def test_svd_mds():
+ # Generate 4 randomly chosen points
+ Y = np.array([[1, 0, 1],
+ [-1, 3, 2],
+ [1, -2, 3],
+ [2, -1, -3]])
+ sim = euclidean_distances(Y)
+ # calculate error or smacof-based solution
+ X_smacof, smacof_stress = mds.smacof(sim, n_components=2, random_state=42)
+ X_svd, svd_stress = mds.svd_mds(sim, n_components=2)
+ assert_less(svd_stress, smacof_stress)
Olivier Grisel Owner
ogrisel added a note

Can you please put an inline comment that explains why this should always be the case?

Olivier Grisel Owner
ogrisel added a note

Should X_smacof and X_svd be approximately be equal up to a sign flip? If so you could check the equality of the component wise absolute values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+
+def test_svd_mds_non_square_matrix():
+ sim = np.ones((3, 4))
+ assert_raises(ValueError, mds.svd_mds, sim)
+
+
+def test_svd_mds_non_symmetric_matrix():
+ sim = np.ones((3, 3))
+ sim[0, 1] = 0
+ assert_raises(ValueError, mds.svd_mds, sim)
+
+
+def test_svd_mds_non_euclidean():
+ # non euclidean similarities (no triangular inequality)
+ sim = np.array([[0, 12, 3, 4],
+ [12, 0, 2, 2],
+ [3, 2, 0, 1],
+ [4, 2, 1, 0]])
+ assert_raises(ValueError, mds.svd_mds, sim)
+
+
+def test_MDS_svd():
+ Y = np.array([[1, 0, 1],
+ [-1, 3, 2],
+ [1, -2, 3],
+ [2, -1, -3]])
+ mds_clf = mds.MDS(metric=True, method="svd")
+ mds_clf.fit(Y)
+
+
+def test_MDS_svd_non_metric():
+ Y = np.array([[1, 0, 1],
+ [-1, 3, 2],
+ [1, -2, 3],
+ [2, -1, -3]])
+ mds_clf = mds.MDS(metric=False, method="svd")
+ assert_raises(ValueError, mds_clf.fit, Y)
+
+
+def test_MDS_unknown_method():
+ Y = np.array([[1, 0, 1],
+ [-1, 3, 2],
+ [1, -2, 3],
+ [2, -1, -3]])
+ mds_clf = mds.MDS(metric=True, method="unknown_method")
+ assert_raises(ValueError, mds_clf.fit, Y)
Something went wrong with that request. Please try again.