sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

chuanren · 2018-12-26T01:54:14Z

Description

The function sklearn.cluster.bicluster.BaseSpectral._svd incorrectly uses the parameters svd_method = 'arpack' and n_discard.

The function _svd should discard the eigenvectors with largest eigenvalues, but the function svds used when svd_method=='arpack' returns the eigenvectors with ascending eigenvalues. This behavior is different with that of the function randomized_svd used when svd_method=='randomized'.

jnothman · 2018-12-26T05:26:46Z

Thanks for the report.

It would be useful to either present a snippet of code that fails your assumptions or link to the error in the code (https://help.github.com/articles/creating-a-permanent-link-to-a-code-snippet/).

But a pull request fixing the issue, ideally with a test, is also welcome.

chuanren · 2018-12-26T15:45:54Z

Code is presented.

The issue is that: the order of eigenvalues computed in the two conditions (randomized and arpack) are different.

A simple fix would be reorder the eigenvalues after the computation. This will also avoid future errors if more parameter choices are included.

scikit-learn/sklearn/cluster/bicluster.py

Lines 132 to 162 in 8d7e849

    
           if self.svd_method == 'randomized': 
        
               kwargs = {} 
        
               if self.n_svd_vecs is not None: 
        
                   kwargs['n_oversamples'] = self.n_svd_vecs 
        
               u, _, vt = randomized_svd(array, n_components, 
        
                                         random_state=self.random_state, 
        
                                         **kwargs) 
        
           elif self.svd_method == 'arpack': 
        
               u, _, vt = svds(array, k=n_components, ncv=self.n_svd_vecs) 
        
               if np.any(np.isnan(vt)): 
        
                   # some eigenvalues of A * A.T are negative, causing 
        
                   # sqrt() to be np.nan. This causes some vectors in vt 
        
                   # to be np.nan. 
        
                   A = safe_sparse_dot(array.T, array) 
        
                   random_state = check_random_state(self.random_state) 
        
                   # initialize with [-1,1] as in ARPACK 
        
                   v0 = random_state.uniform(-1, 1, A.shape[0]) 
        
                   _, v = eigsh(A, ncv=self.n_svd_vecs, v0=v0) 
        
                   vt = v.T 
        
               if np.any(np.isnan(u)): 
        
                   A = safe_sparse_dot(array, array.T) 
        
                   random_state = check_random_state(self.random_state) 
        
                   # initialize with [-1,1] as in ARPACK 
        
                   v0 = random_state.uniform(-1, 1, A.shape[0]) 
        
                   _, u = eigsh(A, ncv=self.n_svd_vecs, v0=v0) 
        
           assert_all_finite(u) 
        
           assert_all_finite(vt) 
        
           u = u[:, n_discard:] 
        
           vt = vt[n_discard:]

gykovacs · 2019-01-01T13:41:15Z

I can confirm the bug:

import numpy as np
from sklearn import cluster

A= np.array([[-2, -4, 2], [-2, 1, 2], [4, 2, 5]])

sc= cluster.bicluster.SpectralCoclustering(n_clusters= 2, 
                                           svd_method='randomized')

sc.fit(A)
print(sc.column_labels_)

gives

[0 0 1]

, but

sc= cluster.bicluster.SpectralCoclustering(n_clusters= 2, 
                                           svd_method='arpack')
sc.fit(A)
print(sc.column_labels_)

gives

[0 0 0]

Checking the arpack docs it clearly states that the singular values are returned in ascending order.
https://www.caam.rice.edu/software/ARPACK/UG/node136.html#SECTION001210000000000000000

I just submitted a pull request #12898 fixing the issue. #12898

chuanren changed the title ~~sklearn.cluster.bicluster.BaseSpectral._svd n_discard from svds~~ sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds Dec 26, 2018

gykovacs mentioned this issue Jan 1, 2019

[MRG] Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn #12898

Open

cmarmo added the module:cluster label Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

chuanren commented Dec 26, 2018

jnothman commented Dec 26, 2018

chuanren commented Dec 26, 2018

gykovacs commented Jan 1, 2019

sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

Comments

chuanren commented Dec 26, 2018

Description

jnothman commented Dec 26, 2018

chuanren commented Dec 26, 2018

gykovacs commented Jan 1, 2019