New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Initialize ARPACK eigsh #5012

Merged
merged 3 commits into from Oct 23, 2015

Conversation

Projects
None yet
5 participants
@yanlend
Contributor

yanlend commented Jul 21, 2015

v0 = random_state.rand(M.shape[0]) leads to an initial residual vector in ARPACK which is all positive. However, this is not the absolute or squared residual, but a true difference. Thus, it is better to initialize with randn to have an equally distributed sign.

The effect of the previous initialization is that eigsh frequently does not converge to the correct eigenvalues, e.g. negative eigenvalues for s.p.d. matrix, which leads to an incorrect null-space.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 22, 2015

The appveyor failure is unrelated and now tracked at #5013. If you wish, you can trigger a new build by using:

git commit --amend
git push -f
@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 22, 2015

About the fix itself, I agree that using random positive residuals seems like a bad idea. I check the source code of ARPACK and it seems that they do it by uniform sampling in the range [-1, 1] instead:

https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/arpack/ARPACK/SRC/dgetv0.f#L225

I don't know if it's supposed to be better than the standard distribution you are using but I think we should try to replicate the native ARPACK behavior as much as possible.

Also do you think you could contribute a non-regression test for this fix?

Finally please add a new entry for this bug fix in doc/whats_new.rst.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 22, 2015

Also we have other occurrences of not properly initialized calls to eigsh in our code base:

git grep "utils.arpack import eigsh"
sklearn/cluster/bicluster.py:from ..utils.arpack import eigsh, svds
sklearn/decomposition/kernel_pca.py:from ..utils.arpack import eigsh
sklearn/manifold/locally_linear.py:from ..utils.arpack import eigsh
sklearn/manifold/spectral_embedding_.py:from ..utils.arpack import eigsh

This gives us the opportunity to fix them all consistently.

@ogrisel ogrisel added the Bug label Jul 22, 2015

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 22, 2015

About the non-regression test I was thinking about a test that would highlight the problem in the locally linear embedding model: some new test that would fail on the current master but that would be fixed by the changes in this PR.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 22, 2015

Apparently those changes cause the following test to fail:

======================================================================
FAIL: sklearn.decomposition.tests.test_nmf.test_sparse_transform
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/decomposition/tests/test_nmf.py", line 166, in test_sparse_transform
    assert_array_almost_equal(A_fit_tr, A_tr, decimal=2)
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/site-packages/numpy/testing/utils.py", line 842, in assert_array_almost_equal
    precision=decimal)
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/site-packages/numpy/testing/utils.py", line 665, in assert_array_compare
    raise AssertionError(msg)
nose.proxy.AssertionError: 
Arrays are not almost equal to 2 decimals
(mismatch 10.0%)
 x: array([[  5.64e-02,   1.02e-01,  -0.00e+00,   5.18e-03],
       [ -0.00e+00,   3.70e-02,   1.36e-01,  -0.00e+00],
       [  2.28e-01,   6.38e-01,  -0.00e+00,  -0.00e+00],...
 y: array([[  5.28e-02,   1.04e-01,  -0.00e+00,   5.57e-03],
       [ -0.00e+00,   1.70e-02,   1.42e-01,  -0.00e+00],
       [  2.13e-01,   6.48e-01,  -0.00e+00,  -0.00e+00],...

I don't see how this is related to the changes of this PR though.

@yanlend

This comment has been minimized.

Contributor

yanlend commented Jul 22, 2015

I see what you mean by the non-regression test. I will try to come up with a test for LocallyLinearEmbedding.

About the failed build: I also could not trace a use of ARPACK or anything related to my changes...

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 22, 2015

Weird, now the NMF test failure is appearing both on appveyor (windows) and travis (linux). I agree I don't see how this test could possibly be related to changes ARPACK or FORTRAN rng seeding in general.

@yanlend

This comment has been minimized.

Contributor

yanlend commented Jul 22, 2015

Well it says in the failed test # This solver seems pretty inconsistent and the accuracy is only set to 2 decimal numbers.
So maybe it's because of different random numbers chosen now.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

So maybe it's because of a different random number chosen now.

Probably but I don't see where by reading the source of the NMF model on sparse input data. We want our estimators to be explicitly seedable for reproducibility. This is not the case for this estimator. It would be great to understand how the changes on this PR have impacted the results of the tests of the NMF model.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

I have just merged #5020 that better seeds this estimator. You might want to rebase this PR on top of the current master.

A = random_state.rand(50, 50)
A = np.dot(A.T, A) # create s.p.d. matrix
A = graph_laplacian(A)

This comment has been minimized.

@ogrisel

ogrisel Jul 23, 2015

Member

Why calling graph_laplacian here? This is not guaranteed to be s.p.d. any more right? I get the following failure:

======================================================================
FAIL: Non-regression test that shows null-space computation is better with
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/volatile/ogrisel/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/tests/test_utils.py", line 152, in test_arpack_eigsh_initialization
    assert_greater_equal(w[0], 0)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/testing.py", line 126, in assert_greater_equal
    assert a >= b, message
nose.proxy.AssertionError: -430.0444945505198 is not greater than or equal to 0
>>  assert -430.0444945505198 >= 0, '-430.0444945505198 is not greater than or equal to 0'

The test passes when using A = np.dot(A.T, A) directly.

This comment has been minimized.

@yanlend

yanlend Jul 23, 2015

Contributor

graph_laplacian makes the eigenvalue problem more difficult. For an undirected graph, the matrix is s.p.d. and the first eigenvalue is 0. The issue with the initialization only occurs for difficult cases.

This comment has been minimized.

@ogrisel

This comment has been minimized.

@ogrisel

ogrisel Jul 23, 2015

Member

Actually the smallest eigen value is negative most of the time on my box:

>>> for i in range(10):
...     print(eigsh(A, k=k, sigma=0.0, v0=np.random.uniform(-1, 1, A.shape[0]))[0][0])
...
-1.13686837722e-15
-146.940587556
-154.837569586
-92.5527859236
-82.614922039
-14.0299334028
-108.269166341
-35.7750459558
-27.0384493697
-42.4412458528

This comment has been minimized.

@yanlend

yanlend Jul 23, 2015

Contributor

This is strange. For me, the eigenvalue is mostly correct (~90%) with uniform, but it mostly fails (~99%) with rand.
Still, the algorithm applied to the Laplacian seems not stable enough to use it in the test.

@@ -157,10 +160,14 @@ def _fit_transform(self, K):
self.lambdas_, self.alphas_ = linalg.eigh(
K, eigvals=(K.shape[0] - n_components, K.shape[0] - 1))
elif eigen_solver == 'arpack':
random_state = check_random_state(self.random_state)
# initialize with [-1,1] as in ARPACK
v0 = random_state.uniform(-1,1, K.shape[0])

This comment has been minimized.

@ogrisel

ogrisel Jul 23, 2015

Member

Please follow the PEP8 code style: missing whitespace after comma.

# New initialization [-1,1] (as in original ARPACK)
# Was [0,1] before, with which this test could fail
v0 = random_state.uniform(-1,1, A.shape[0])
w, _ = eigsh(A, k=k, sigma=0.0, v0=v0)

This comment has been minimized.

@ogrisel

ogrisel Jul 23, 2015

Member

For this kind of test to be of interest I think we should run it many times with different seeds:

for seed in range(30):
    random_state = check_random_state(i)
    # put the content of the test here

However we need to make sure that it is still fast enough to run (e.g. < 100ms).

This comment has been minimized.

@ogrisel

ogrisel Oct 23, 2015

Member

This comment was not addressed.

@amueller

This comment has been minimized.

Member

amueller commented Jul 23, 2015

I haven't looked into the issue (sorry travelling) but if this is a problem with eigsh we should implement a wrapper with a fix and talk to scipy about fixing upstream.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

The failing doctest on travis is caused by a missing random_state=None in doc/modules/pipeline.rst at line 154.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

I haven't looked into the issue (sorry travelling) but if this is a problem with eigsh we should implement a wrapper with a fix and talk to scipy about fixing upstream.

Maybe we could add a random_state=None parameter to eigsh that does the uniform [-1, 1] random init with a numpy RNG without using the lapack RNG and we could contribute that to scipy. I am not proficient in FORTRAN enough to be sure that this is 100% equivalent to using the default FORTRAN based seeding though.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

@yanlend can you please squash all your commits and rebase on top of current master?

@yanlend

This comment has been minimized.

Contributor

yanlend commented Jul 23, 2015

@ogrisel done!

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

Weird, one of the windows tests has failed with:

======================================================================
FAIL: Non-regression test that shows null-space computation is better with
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\utils\tests\test_utils.py", line 152, in test_arpack_eigsh_initialization
    assert_greater_equal(w[0], 0)
  File "C:\Python27\lib\site-packages\sklearn\utils\testing.py", line 126, in assert_greater_equal
    assert a >= b, message
AssertionError: -127.75430418558309 is not greater than or equal to 0
----------------------------------------------------------------------

I wonder if it's random or not. Will try to relaunch this on appveyor. As the seed is fixed we should get the same failure.

BTW please fix the failing doctest in pipeline.rst as reported by travis. You just need to update the doctest to add a random_state=None in the text representation of that estimator.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 23, 2015

You can ignore the OneClassSVM segfault that is unrelated. However the test on non-negative eigenvectors is seemingly randomly failing under windows with 32 bits python. Maybe the ARPACK solver is just not stable enough on 32 bit Python despite the deterministic seeding for some reason I do not quite understand...

@@ -126,6 +130,28 @@ def test_pinvh_simple_complex():
assert_almost_equal(np.dot(a, a_pinv), np.eye(3))
def test_arpack_eigsh_initialization():
'''

This comment has been minimized.

@amueller

amueller Jul 30, 2015

Member

We prefer to start tests with comments, not docstrings, so we can see the name of the test that is failing.

@amueller

This comment has been minimized.

Member

amueller commented Jul 30, 2015

Maybe we could add a random_state=None parameter to eigsh that does the uniform [-1, 1] random init with a numpy RNG without using the lapack RNG and we could contribute that to scipy. I am not proficient in FORTRAN enough to be sure that this is 100% equivalent to using the default FORTRAN based seeding though.

I'm not sure I understand the issue then. I thought scipy was using v0 = random_state.rand(M.shape[0]) by default, while ARPACK is using a uniform [-1, 1] init. That seems like a bug in scipy.

@yanlend

This comment has been minimized.

Contributor

yanlend commented Jul 31, 2015

If no initial v0 is passed to scipy, it does no initialization itself but lets ARPACK handle it. However, ARPACK is not using random_state, so the behavior is not deterministic.

The main issue is that in some places of sklearn, v0 was passed as an argument to eigsh with the interval [0, 1] instead of [-1, 1]. This led to worse convergence in some cases.

@amueller

This comment has been minimized.

Member

amueller commented Jul 31, 2015

ok, got it now. Thanks for the explanation.

@amueller

This comment has been minimized.

Member

amueller commented Sep 9, 2015

What is the status of this?

@ogrisel ogrisel referenced this pull request Oct 5, 2015

Merged

[MRG+3] Collapsing PCA and RandomizedPCA #5299

8 of 8 tasks complete
@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 5, 2015

@giorgiop, as @yanlend does not seem to have the time to finish this if would be great if you could pick it up.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 5, 2015

I particular, this PR need a rebase (and optionally squash) on top of current master.

@yanlend

This comment has been minimized.

Contributor

yanlend commented Oct 19, 2015

@ogrisel I did the rebase.

rt.fit(X_train, y_train)
rt_lm.fit(SelectFromModel(rt, prefit=True).transform(X_train_lr), y_train_lr)
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
random_state=0)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 19, 2015

Member

@yanlend : are these changes supposed to be in your PR? They look strange to me and somewhat unrelated.

@@ -38,9 +37,7 @@
# use RandomTreesEmbedding to transform data
hasher = RandomTreesEmbedding(n_estimators=10, random_state=0, max_depth=3)
hasher.fit(X)
model = SelectFromModel(hasher, prefit=True)
X_transformed = model.transform(X)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 19, 2015

Member

Same remark here: it seems that these changes are unrelated to Arpack initialization. I don't understand.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Oct 19, 2015

I wonder if there are not merge artifacts in the current status of the PR.

@yanlend

This comment has been minimized.

Contributor

yanlend commented Oct 19, 2015

@GaelVaroquaux sorry about that, I somehow messed up the rebase. It should be fixed now.

@GaelVaroquaux GaelVaroquaux changed the title from Initialize ARPACK eigsh to [MRG+1] Initialize ARPACK eigsh Oct 19, 2015

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Oct 19, 2015

Travis is not running on this guy. I don't understand why.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Oct 19, 2015

@yanlend : I am sorry: there conflicts again. This bothers me, as I would like to have the tests run before merging.

Can you rebase once again? I am sorry, this is happening because, as there is a sprint, there is a lot of activity on the project.

@yanlend

This comment has been minimized.

Contributor

yanlend commented Oct 20, 2015

I don't think the failed tests of PCA whitening are related to this PR. However, the regression test test_arpack_eigsh_initialization() also fails sometimes. The initialization propose here is more stable than the previous one, but it is not a guarantee that eigsh will converge. That's why I think it is not a good idea to include that test.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Oct 21, 2015

The failed test of PCA whitening don't fail on master and they seem legit to me.

For the other, I am still thinking and trying to understand what is the right way to test this.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 21, 2015

That test was actually not deterministic. I fixed it in #5141 here

Can you please rebase on master and push again?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Oct 21, 2015

With regards to the failing test (not the whitening one): how about we add 1e-7 on the diagonal of the matrix. This should make the test slightly more easy, and I would hope that it passes.

yanlend added some commits Jul 21, 2015

Initialize ARPACK eigsh
`v0 = random_state.rand(M.shape[0])` leads to an initial residual vector in ARPACK which is all positive. However, this is not the absolute or squared residual, but a true difference. Thus, it is better to initialize with `v0=random_state.uniform(-1, 1, M.shape[0])` to have an equally distributed sign. This is the way that ARPACK initializes the residuals.
The effect of the previous initialization is that eigsh frequently does not converge to the correct eigenvalues, e.g. negative eigenvalues for s.p.d. matrix, which leads to an incorrect null-space.

- initialized all occurences of sklearn.utils.arpack.eigsh the same way it would be initialzed by ARPACK
- regression test to test behavior of new initialization
Update test_utils.py
Comment instead of docstring to follow sklearn convention.
Regression test made easier by adding small value to main diagonal of s.p.d. matrix
@yanlend

This comment has been minimized.

Contributor

yanlend commented Oct 23, 2015

I did the rebase and added the suggestion of @GaelVaroquaux.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 23, 2015

  • Travis passed
@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Oct 23, 2015

OK, I am merging. Good job!

GaelVaroquaux added a commit that referenced this pull request Oct 23, 2015

Merge pull request #5012 from yanlend/patch-2
[MRG+1] Initialize ARPACK eigsh

@GaelVaroquaux GaelVaroquaux merged commit 2492dd0 into scikit-learn:master Oct 23, 2015

1 of 2 checks passed

continuous-integration/appveyor Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@yanlend yanlend deleted the yanlend:patch-2 branch Oct 23, 2015

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 23, 2015

Do you want to squash those commits @GaelVaroquaux ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment