Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Initialize ARPACK eigsh #5012

Merged
merged 3 commits into from Oct 23, 2015
Merged

Conversation

yanlend
Copy link
Contributor

@yanlend yanlend commented Jul 21, 2015

v0 = random_state.rand(M.shape[0]) leads to an initial residual vector in ARPACK which is all positive. However, this is not the absolute or squared residual, but a true difference. Thus, it is better to initialize with randn to have an equally distributed sign.

The effect of the previous initialization is that eigsh frequently does not converge to the correct eigenvalues, e.g. negative eigenvalues for s.p.d. matrix, which leads to an incorrect null-space.

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

The appveyor failure is unrelated and now tracked at #5013. If you wish, you can trigger a new build by using:

git commit --amend
git push -f

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

About the fix itself, I agree that using random positive residuals seems like a bad idea. I check the source code of ARPACK and it seems that they do it by uniform sampling in the range [-1, 1] instead:

https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/arpack/ARPACK/SRC/dgetv0.f#L225

I don't know if it's supposed to be better than the standard distribution you are using but I think we should try to replicate the native ARPACK behavior as much as possible.

Also do you think you could contribute a non-regression test for this fix?

Finally please add a new entry for this bug fix in doc/whats_new.rst.

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

Also we have other occurrences of not properly initialized calls to eigsh in our code base:

git grep "utils.arpack import eigsh"
sklearn/cluster/bicluster.py:from ..utils.arpack import eigsh, svds
sklearn/decomposition/kernel_pca.py:from ..utils.arpack import eigsh
sklearn/manifold/locally_linear.py:from ..utils.arpack import eigsh
sklearn/manifold/spectral_embedding_.py:from ..utils.arpack import eigsh

This gives us the opportunity to fix them all consistently.

@ogrisel ogrisel added the Bug label Jul 22, 2015
@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

About the non-regression test I was thinking about a test that would highlight the problem in the locally linear embedding model: some new test that would fail on the current master but that would be fixed by the changes in this PR.

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

Apparently those changes cause the following test to fail:

======================================================================
FAIL: sklearn.decomposition.tests.test_nmf.test_sparse_transform
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/decomposition/tests/test_nmf.py", line 166, in test_sparse_transform
    assert_array_almost_equal(A_fit_tr, A_tr, decimal=2)
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/site-packages/numpy/testing/utils.py", line 842, in assert_array_almost_equal
    precision=decimal)
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/site-packages/numpy/testing/utils.py", line 665, in assert_array_compare
    raise AssertionError(msg)
nose.proxy.AssertionError: 
Arrays are not almost equal to 2 decimals
(mismatch 10.0%)
 x: array([[  5.64e-02,   1.02e-01,  -0.00e+00,   5.18e-03],
       [ -0.00e+00,   3.70e-02,   1.36e-01,  -0.00e+00],
       [  2.28e-01,   6.38e-01,  -0.00e+00,  -0.00e+00],...
 y: array([[  5.28e-02,   1.04e-01,  -0.00e+00,   5.57e-03],
       [ -0.00e+00,   1.70e-02,   1.42e-01,  -0.00e+00],
       [  2.13e-01,   6.48e-01,  -0.00e+00,  -0.00e+00],...

I don't see how this is related to the changes of this PR though.

@yanlend
Copy link
Contributor Author

yanlend commented Jul 22, 2015

I see what you mean by the non-regression test. I will try to come up with a test for LocallyLinearEmbedding.

About the failed build: I also could not trace a use of ARPACK or anything related to my changes...

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

Weird, now the NMF test failure is appearing both on appveyor (windows) and travis (linux). I agree I don't see how this test could possibly be related to changes ARPACK or FORTRAN rng seeding in general.

@yanlend
Copy link
Contributor Author

yanlend commented Jul 22, 2015

Well it says in the failed test # This solver seems pretty inconsistent and the accuracy is only set to 2 decimal numbers.
So maybe it's because of different random numbers chosen now.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

So maybe it's because of a different random number chosen now.

Probably but I don't see where by reading the source of the NMF model on sparse input data. We want our estimators to be explicitly seedable for reproducibility. This is not the case for this estimator. It would be great to understand how the changes on this PR have impacted the results of the tests of the NMF model.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

I have just merged #5020 that better seeds this estimator. You might want to rebase this PR on top of the current master.


A = random_state.rand(50, 50)
A = np.dot(A.T, A) # create s.p.d. matrix
A = graph_laplacian(A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why calling graph_laplacian here? This is not guaranteed to be s.p.d. any more right? I get the following failure:

======================================================================
FAIL: Non-regression test that shows null-space computation is better with
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/volatile/ogrisel/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/tests/test_utils.py", line 152, in test_arpack_eigsh_initialization
    assert_greater_equal(w[0], 0)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/testing.py", line 126, in assert_greater_equal
    assert a >= b, message
nose.proxy.AssertionError: -430.0444945505198 is not greater than or equal to 0
>>  assert -430.0444945505198 >= 0, '-430.0444945505198 is not greater than or equal to 0'

The test passes when using A = np.dot(A.T, A) directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_laplacian makes the eigenvalue problem more difficult. For an undirected graph, the matrix is s.p.d. and the first eigenvalue is 0. The issue with the initialization only occurs for difficult cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the smallest eigen value is negative most of the time on my box:

>>> for i in range(10):
...     print(eigsh(A, k=k, sigma=0.0, v0=np.random.uniform(-1, 1, A.shape[0]))[0][0])
...
-1.13686837722e-15
-146.940587556
-154.837569586
-92.5527859236
-82.614922039
-14.0299334028
-108.269166341
-35.7750459558
-27.0384493697
-42.4412458528

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange. For me, the eigenvalue is mostly correct (~90%) with uniform, but it mostly fails (~99%) with rand.
Still, the algorithm applied to the Laplacian seems not stable enough to use it in the test.

@amueller
Copy link
Member

I haven't looked into the issue (sorry travelling) but if this is a problem with eigsh we should implement a wrapper with a fix and talk to scipy about fixing upstream.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

The failing doctest on travis is caused by a missing random_state=None in doc/modules/pipeline.rst at line 154.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

I haven't looked into the issue (sorry travelling) but if this is a problem with eigsh we should implement a wrapper with a fix and talk to scipy about fixing upstream.

Maybe we could add a random_state=None parameter to eigsh that does the uniform [-1, 1] random init with a numpy RNG without using the lapack RNG and we could contribute that to scipy. I am not proficient in FORTRAN enough to be sure that this is 100% equivalent to using the default FORTRAN based seeding though.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

@yanlend can you please squash all your commits and rebase on top of current master?

@yanlend
Copy link
Contributor Author

yanlend commented Jul 23, 2015

@ogrisel done!

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

Weird, one of the windows tests has failed with:

======================================================================
FAIL: Non-regression test that shows null-space computation is better with
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\utils\tests\test_utils.py", line 152, in test_arpack_eigsh_initialization
    assert_greater_equal(w[0], 0)
  File "C:\Python27\lib\site-packages\sklearn\utils\testing.py", line 126, in assert_greater_equal
    assert a >= b, message
AssertionError: -127.75430418558309 is not greater than or equal to 0
----------------------------------------------------------------------

I wonder if it's random or not. Will try to relaunch this on appveyor. As the seed is fixed we should get the same failure.

BTW please fix the failing doctest in pipeline.rst as reported by travis. You just need to update the doctest to add a random_state=None in the text representation of that estimator.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

You can ignore the OneClassSVM segfault that is unrelated. However the test on non-negative eigenvectors is seemingly randomly failing under windows with 32 bits python. Maybe the ARPACK solver is just not stable enough on 32 bit Python despite the deterministic seeding for some reason I do not quite understand...

@@ -126,6 +130,28 @@ def test_pinvh_simple_complex():
assert_almost_equal(np.dot(a, a_pinv), np.eye(3))


def test_arpack_eigsh_initialization():
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to start tests with comments, not docstrings, so we can see the name of the test that is failing.

@amueller
Copy link
Member

Maybe we could add a random_state=None parameter to eigsh that does the uniform [-1, 1] random init with a numpy RNG without using the lapack RNG and we could contribute that to scipy. I am not proficient in FORTRAN enough to be sure that this is 100% equivalent to using the default FORTRAN based seeding though.

I'm not sure I understand the issue then. I thought scipy was using v0 = random_state.rand(M.shape[0]) by default, while ARPACK is using a uniform [-1, 1] init. That seems like a bug in scipy.

@yanlend
Copy link
Contributor Author

yanlend commented Jul 31, 2015

If no initial v0 is passed to scipy, it does no initialization itself but lets ARPACK handle it. However, ARPACK is not using random_state, so the behavior is not deterministic.

The main issue is that in some places of sklearn, v0 was passed as an argument to eigsh with the interval [0, 1] instead of [-1, 1]. This led to worse convergence in some cases.

@amueller
Copy link
Member

ok, got it now. Thanks for the explanation.

@amueller
Copy link
Member

amueller commented Sep 9, 2015

What is the status of this?

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2015

@giorgiop, as @yanlend does not seem to have the time to finish this if would be great if you could pick it up.

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2015

I particular, this PR need a rebase (and optionally squash) on top of current master.

@yanlend
Copy link
Contributor Author

yanlend commented Oct 19, 2015

@ogrisel I did the rebase.

rt.fit(X_train, y_train)
rt_lm.fit(SelectFromModel(rt, prefit=True).transform(X_train_lr), y_train_lr)
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanlend : are these changes supposed to be in your PR? They look strange to me and somewhat unrelated.

@GaelVaroquaux
Copy link
Member

I wonder if there are not merge artifacts in the current status of the PR.

@yanlend yanlend force-pushed the patch-2 branch 2 times, most recently from 540f37d to c6c9160 Compare October 19, 2015 13:03
@yanlend
Copy link
Contributor Author

yanlend commented Oct 19, 2015

@GaelVaroquaux sorry about that, I somehow messed up the rebase. It should be fixed now.

@GaelVaroquaux GaelVaroquaux changed the title Initialize ARPACK eigsh [MRG+1] Initialize ARPACK eigsh Oct 19, 2015
@GaelVaroquaux
Copy link
Member

Travis is not running on this guy. I don't understand why.

@GaelVaroquaux
Copy link
Member

@yanlend : I am sorry: there conflicts again. This bothers me, as I would like to have the tests run before merging.

Can you rebase once again? I am sorry, this is happening because, as there is a sprint, there is a lot of activity on the project.

@yanlend
Copy link
Contributor Author

yanlend commented Oct 20, 2015

I don't think the failed tests of PCA whitening are related to this PR. However, the regression test test_arpack_eigsh_initialization() also fails sometimes. The initialization propose here is more stable than the previous one, but it is not a guarantee that eigsh will converge. That's why I think it is not a good idea to include that test.

@GaelVaroquaux
Copy link
Member

The failed test of PCA whitening don't fail on master and they seem legit to me.

For the other, I am still thinking and trying to understand what is the right way to test this.

@giorgiop
Copy link
Contributor

That test was actually not deterministic. I fixed it in #5141 here

Can you please rebase on master and push again?

@GaelVaroquaux
Copy link
Member

With regards to the failing test (not the whitening one): how about we add 1e-7 on the diagonal of the matrix. This should make the test slightly more easy, and I would hope that it passes.

yanlend and others added 3 commits October 23, 2015 08:57
`v0 = random_state.rand(M.shape[0])` leads to an initial residual vector in ARPACK which is all positive. However, this is not the absolute or squared residual, but a true difference. Thus, it is better to initialize with `v0=random_state.uniform(-1, 1, M.shape[0])` to have an equally distributed sign. This is the way that ARPACK initializes the residuals.
The effect of the previous initialization is that eigsh frequently does not converge to the correct eigenvalues, e.g. negative eigenvalues for s.p.d. matrix, which leads to an incorrect null-space.

- initialized all occurences of sklearn.utils.arpack.eigsh the same way it would be initialzed by ARPACK
- regression test to test behavior of new initialization
Comment instead of docstring to follow sklearn convention.
Regression test made easier by adding small value to main diagonal of s.p.d. matrix
@yanlend
Copy link
Contributor Author

yanlend commented Oct 23, 2015

I did the rebase and added the suggestion of @GaelVaroquaux.

@giorgiop
Copy link
Contributor

  • Travis passed

@GaelVaroquaux
Copy link
Member

OK, I am merging. Good job!

GaelVaroquaux added a commit that referenced this pull request Oct 23, 2015
[MRG+1] Initialize ARPACK eigsh
@GaelVaroquaux GaelVaroquaux merged commit 2492dd0 into scikit-learn:master Oct 23, 2015
@yanlend yanlend deleted the patch-2 branch October 23, 2015 07:34
@giorgiop
Copy link
Contributor

Do you want to squash those commits @GaelVaroquaux ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants