[MRG] Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn #12898

gykovacs · 2019-01-01T13:40:30Z

Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn

Reference Issues/PRs

Fixes [MRG] not-yet-confirmed issue #12863

What does this implement/fix? Explain your changes.

In the sklearn code it was assumed that the scipy wrapper for arpack singular value decomposition returns singular values (and corresponding vectors) in descending order. This is not the case. The ARPACK documentation (https://www.caam.rice.edu/software/ARPACK/UG/node136.html#SECTION001210000000000000000
) clearly states that the singular values are returned in ascending order.

The bug manifests in differing results depending on the solver:

import numpy as np
from sklearn import cluster

A= np.array([[-2, -4, 2], [-2, 1, 2], [4, 2, 5]])

sc= cluster.bicluster.SpectralCoclustering(n_clusters= 2, 
                                           svd_method='randomized')

sc.fit(A)
print(sc.column_labels_)

gives

[0 0 1]

, but

sc= cluster.bicluster.SpectralCoclustering(n_clusters= 2, 
                                           svd_method='arpack')
sc.fit(A)
print(sc.column_labels_)

gives

[0 0 0]

This bugfix makes the arpack call return the same result as the randomized sv based solution.

Any other comments?

…rd fixed accordingly

jnothman · 2019-01-06T23:04:46Z

Please add a test.

gykovacs · 2019-01-18T22:55:15Z

I'm working on it!

…he same results in spectral coclustering

gykovacs · 2019-01-20T14:03:15Z

I have added a test checking if the labels implied by the singular values computed by 'arpack' and 'randomized' methods are equal (the svs are not available in the BaseSpectral objects, so the direct comparison of the order of svs cannot be tested).

I also tested the test, it passes with my updates to BaseSpectral._svd but fails without it, so the test matrix seems to be a good choice to drive the testing.

However, one CI job fails: the one related to OSX. Interestingly, another svd calculation fails, the numbers do not match the expectations in the 12th digit. In my best understanding this has nothing to do with my bugfix, this mismatch is based on an svd call made directly to scipy at line 163 of decomposition/truncated_svd.py. I'm wondering how to proceed. Shall I maybe decrease the required precision in this test to 11 digits?

jnothman · 2019-01-20T20:56:04Z

I've not looked into the failure, but I think it must be due to your change, even if your change is right.

jnothman · 2019-01-20T20:58:31Z

Yes, I think it should be okay to increase the tolerance.

jnothman · 2019-01-20T20:59:53Z

(The rpca in that test code is a bit weird... both are using 'arpack', not 'randomized')

gykovacs · 2019-01-21T00:01:08Z

I did some experimentation, let me summarize the results:

the test I added (in test_bicluster) makes a completely different test (test_singuler_values in test_truncated_svd.py) fail.
the common in the two tests is the arpack call
seemingly, the scipy arpack based svd is stateful

I also tried fixing the test "test_singular_value" by changing one of the algorithms to "randomized", the result is that the some singular values fail matching at the first digit following the decimal point:

 apca = TruncatedSVD(n_components=2, algorithm='arpack',
                     random_state=rng).fit(X)
 rpca = TruncatedSVD(n_components=2, algorithm='randomized',
                     random_state=rng).fit(X)

  assert_array_almost_equal(apca.singular_values_, rpca.singular_values_, 12)

E AssertionError:
E Arrays are not almost equal to 12 decimals
E
E (mismatch 100.0%)
E x: array([17.574469336247, 17.470009192113])
E y: array([17.522384124448, 17.359669704603])

My consequence is that "test_singular_values" needs to be fixed, however, I find changing the precision to 0 digits a too heavy intervention, although this difference in precision is what I would expect when a randomized method is involved.

Would it make sense to play around with the parameters (like the size of the sample matrix) and try to find a combination which ensures at least 3-4 digits matching?

jnothman · 2019-01-21T00:36:36Z

Are you sure it's the change to the test, not the change to the implementation, that resulted in the failure?

jnothman · 2019-01-21T00:37:52Z

I suspect that when that test comparing rpca to apca was adapted from test_pca, it was seen as inappropriate to compare apca and rpca, and somehow randomized was changed to arpack without thinking it through... not sure though.

jnothman · 2019-01-21T00:39:49Z

So I don't think we should be testing randomized there, at least not in this PR.

gykovacs · 2019-01-21T11:31:59Z

I did some further checks, let me summarize briefly:

originally, I added a fix to the bug, and all tests passed (https://travis-ci.org/scikit-learn/scikit-
learn/builds/474089316?utm_source=github_status&utm_medium=notification)
then, I added a test checking the labels in coclustering (which are implied by the ordering of svds), and surprisingly, a different test failed (https://travis-ci.org/scikit-learn/scikit-learn/builds/482023389?utm_source=github_status&utm_medium=notification)
then, I put a SkipException into the test I added (to prevent from running), and all tests passed again (with my bugfix being part of the package), which suggests that not the bugfix, but the execution of the additional test causes the failure of that other test (https://travis-ci.org/scikit-learn/scikit-learn/builds/482182493?utm_source=github_status&utm_medium=notification)
One more observation on why scipy arpack seems to be stateful: as you already mentioned, the failing test is not correct, as rpca uses arpack just like apca. Both of these objects are initialized by the same random state and parameters, thus, the objects are identical and should give identical results. Contrarily, the test fails because of a tiny difference in the 12th digit. In my interpretation there should be no way of having any difference between the results unless some execution path maintains some state or uncontrolled randomness, and my guess is that it comes from scipy, as it is the only common point between the two tests.

Now, I have the bugfix in the code, my test is added, the required precision of the failing (arpack vs arpack) test is decreased to 11 digits, and things seem to work fine, all tests are passing. Maybe it would worth reporting the failure of the arpack vs arpack comparison in the truncated svd test (they should match to any precision) as a separate issue.

Any comments are welcome!

jnothman

Thanks. I've confirmed the test failing at master.

Please add an entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

The arpack state issue might be worth raising separately.

gykovacs · 2019-01-22T11:45:01Z

Done! Thank you for the guidance! I'll raise the arpack state issue and also try to take it soon.

cmarmo · 2021-03-25T15:19:31Z

Hi @gykovacs , if you are still interested in working on that, do you mind fixing conflicts? Thanks for your work and your patience!

gykovacs · 2021-03-25T15:34:35Z

Yeah, let me look into it!

gykovacs added 2 commits January 1, 2019 14:23

arpack returns singular values in ascending order, the use of n_disca…

d0784a8

…rd fixed accordingly

arpack returns sv-s in ascending order, the use of n_discard fixed

007ac8c

gykovacs mentioned this pull request Jan 1, 2019

sklearn.cluster.bicluster.BaseSpectral._svd: n_discard eigenvectors from svds #12863

Open

whitespace from blank lines removed

95bd518

jnothman added the Bug label Jan 16, 2019

gykovacs added 4 commits January 20, 2019 13:21

test added to check if arpack based and randomized svd solvers give t…

45153b1

…he same results in spectral coclustering

trailing whitespace removed

b11ca3a

one more blank line added between tests

b64096a

another blank line added between tests

b6063f7

gykovacs added 5 commits January 20, 2019 22:46

test fixed by changing arpack parameter to randomized

430730b

retry to see if the problem is numeric

fdd081b

changes reverted to check if the test passes

33fe48d

skipping the test added to see if arpack is likely to be stateful

e347955

fix added again

22a8bc2

gykovacs added 2 commits January 21, 2019 09:58

original PR + decrease of required precision in the failing test

8a80293

empty commit to trigger new CI

66e5bf3

jnothman reviewed Jan 22, 2019

View reviewed changes

gykovacs added 2 commits January 22, 2019 11:45

doc updated by bugfix documentation referring pull request 12898

890fc36

Merge branch 'master' into feature-arpack

91b398b

amueller added the Waiting for Reviewer label Aug 6, 2019

github-actions bot added module:cluster module:decomposition labels Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

cmarmo removed the Waiting for Reviewer label Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn #12898

[MRG] Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn #12898

gykovacs commented Jan 1, 2019

jnothman commented Jan 6, 2019

gykovacs commented Jan 18, 2019

gykovacs commented Jan 20, 2019

jnothman commented Jan 20, 2019

jnothman commented Jan 20, 2019

jnothman commented Jan 20, 2019

gykovacs commented Jan 21, 2019

jnothman commented Jan 21, 2019

jnothman commented Jan 21, 2019

jnothman commented Jan 21, 2019

gykovacs commented Jan 21, 2019 •

edited

Loading

jnothman left a comment

gykovacs commented Jan 22, 2019

cmarmo commented Mar 25, 2021

gykovacs commented Mar 25, 2021

[MRG] Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn #12898

Are you sure you want to change the base?

[MRG] Bugfix for not-yet-confirmed issue #12863: arpack returns singular values in ascending order, the opposite was supposed in sklearn #12898

Conversation

gykovacs commented Jan 1, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jnothman commented Jan 6, 2019

gykovacs commented Jan 18, 2019

gykovacs commented Jan 20, 2019

jnothman commented Jan 20, 2019

jnothman commented Jan 20, 2019

jnothman commented Jan 20, 2019

gykovacs commented Jan 21, 2019

jnothman commented Jan 21, 2019

jnothman commented Jan 21, 2019

jnothman commented Jan 21, 2019

gykovacs commented Jan 21, 2019 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

gykovacs commented Jan 22, 2019

cmarmo commented Mar 25, 2021

gykovacs commented Mar 25, 2021

gykovacs commented Jan 21, 2019 •

edited

Loading