Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

Open
chuanren opened this issue Dec 26, 2018 · 3 comments
Open

Comments

@chuanren
Copy link

Description

The function sklearn.cluster.bicluster.BaseSpectral._svd incorrectly uses the parameters svd_method = 'arpack' and n_discard.

The function _svd should discard the eigenvectors with largest eigenvalues, but the function svds used when svd_method=='arpack' returns the eigenvectors with ascending eigenvalues. This behavior is different with that of the function randomized_svd used when svd_method=='randomized'.

@chuanren chuanren changed the title sklearn.cluster.bicluster.BaseSpectral._svd n_discard from svds sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds Dec 26, 2018
@jnothman
Copy link
Member

Thanks for the report.

It would be useful to either present a snippet of code that fails your assumptions or link to the error in the code (https://help.github.com/articles/creating-a-permanent-link-to-a-code-snippet/).

But a pull request fixing the issue, ideally with a test, is also welcome.

@chuanren
Copy link
Author

Code is presented.

The issue is that: the order of eigenvalues computed in the two conditions (randomized and arpack) are different.

A simple fix would be reorder the eigenvalues after the computation. This will also avoid future errors if more parameter choices are included.

if self.svd_method == 'randomized':
kwargs = {}
if self.n_svd_vecs is not None:
kwargs['n_oversamples'] = self.n_svd_vecs
u, _, vt = randomized_svd(array, n_components,
random_state=self.random_state,
**kwargs)
elif self.svd_method == 'arpack':
u, _, vt = svds(array, k=n_components, ncv=self.n_svd_vecs)
if np.any(np.isnan(vt)):
# some eigenvalues of A * A.T are negative, causing
# sqrt() to be np.nan. This causes some vectors in vt
# to be np.nan.
A = safe_sparse_dot(array.T, array)
random_state = check_random_state(self.random_state)
# initialize with [-1,1] as in ARPACK
v0 = random_state.uniform(-1, 1, A.shape[0])
_, v = eigsh(A, ncv=self.n_svd_vecs, v0=v0)
vt = v.T
if np.any(np.isnan(u)):
A = safe_sparse_dot(array, array.T)
random_state = check_random_state(self.random_state)
# initialize with [-1,1] as in ARPACK
v0 = random_state.uniform(-1, 1, A.shape[0])
_, u = eigsh(A, ncv=self.n_svd_vecs, v0=v0)
assert_all_finite(u)
assert_all_finite(vt)
u = u[:, n_discard:]
vt = vt[n_discard:]

@gykovacs
Copy link

gykovacs commented Jan 1, 2019

I can confirm the bug:

import numpy as np
from sklearn import cluster

A= np.array([[-2, -4, 2], [-2, 1, 2], [4, 2, 5]])

sc= cluster.bicluster.SpectralCoclustering(n_clusters= 2, 
                                           svd_method='randomized')

sc.fit(A)
print(sc.column_labels_)

gives

[0 0 1]

, but

sc= cluster.bicluster.SpectralCoclustering(n_clusters= 2, 
                                           svd_method='arpack')
sc.fit(A)
print(sc.column_labels_)

gives

[0 0 0]

Checking the arpack docs it clearly states that the singular values are returned in ascending order.
https://www.caam.rice.edu/software/ARPACK/UG/node136.html#SECTION001210000000000000000

I just submitted a pull request #12898 fixing the issue. #12898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants