[MRG] add lobpcg svd_solver to PCA and TruncatedSVD #12319

lobpcg · 2018-10-07T15:50:13Z

Reference Issues/PRs

fixes #12079, fixes #12080

What does this implement/fix? Explain your changes.

#12079 adds LOBPCG as an SVD solver in PCA
#12080 adds LOBPCG solver to Truncated PCA

lobpcg_svd should also be useful in KernelPCA for faster partial decompositions, see #12068

This PR also includes multiple LOBPCG related bug fixes, including vendoring sklearn/externals/_lobpcg.py from scipy 1.3.0

Any other comments?

@ogrisel Transferred from permanently closed PR #12291

Keep in mind for testing, that lobpcg_svd falls back to dense eigensolver unless n_components < 3*matrix_size, where matrix_size = min (n_samples, n_features)

Still to do, better in new focused PRs after this one is merged

example plot_faces_decomposition may include lobpcg_svd, just change

('Eigenfaces - PCA using randomized SVD',
decomposition.PCA(n_components=n_components, svd_solver='randomized',
whiten=True),
True),

to

('Eigenfaces - PCA using randomized SVD',
 decomposition.PCA(n_components=n_components, svd_solver='lobpcg',
                   whiten=True),
 True),

but lobpcg currently fails here for unclear numerical reasons. More testing may be needed for float32 data, like in this example.

All four existing TruncatedSVD examples of scikit-learn in the examples/ folder do run with lobpcg, just by adding the option ", algorithm='lobpcg' " to TruncatedSVD function call. But none generates the matrix large enough to demonstrate the practical benefits of lobpcg_svd.

sklearn-lgtm · 2018-10-07T22:07:27Z

This pull request introduces 5 alerts when merging 539dfd4 into 63e5ae6 - view on LGTM.com

new alerts:

1 for Non-callable called
1 for Unused import
1 for 'input' function used
1 for Module is imported with 'import' and 'import from'
1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-08T02:08:22Z

This pull request introduces 3 alerts when merging 2beaafc into 63e5ae6 - view on LGTM.com

new alerts:

1 for Non-callable called
1 for 'input' function used
1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-08T03:01:45Z

This pull request introduces 3 alerts when merging 5375a48 into 63e5ae6 - view on LGTM.com

new alerts:

1 for 'input' function used
1 for Module is imported with 'import' and 'import from'
1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-08T15:10:33Z

This pull request introduces 3 alerts when merging 17c9196 into 4e2e1fa - view on LGTM.com

new alerts:

1 for 'input' function used
1 for Module is imported with 'import' and 'import from'
1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-08T19:46:47Z

This pull request introduces 2 alerts when merging 3546217 into 5fd9e03 - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'
1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

rc · 2018-10-08T21:30:36Z

The development docs say: "The full scikit-learn tests can be run using ‘make’ in the root folder. Alternatively, running ‘pytest’ in a folder will run all the tests of the corresponding subpackages." This works for me:

pytest sklearn/decomposition/

after building the package in-place python setup.py build_ext -i.

sklearn-lgtm · 2018-10-09T02:02:13Z

This pull request introduces 2 alerts when merging d658176 into 5fd9e03 - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'
1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

rc · 2018-10-09T10:54:03Z

OK, I have your branch now. I am looking at the LGTM alerts.

rc · 2018-10-09T11:12:58Z

I am not sure how you want to proceed with working on this branch together, so here is my proposal:

I have just pushed a small fix to https://github.com/rc/scikit-learn/, branch lobpcg_svd.
Now you can add my repo to your .git/config and fetch from it (see below).
Then look at my changes and eventually merge them into your branch, and update this PR.
I will always fetch your branch before making anything to stay up-to-date.

You can add my repo by adding the following to your .git/config:

[remote "rc"]
	url = https://github.com/rc/scikit-learn.git
	fetch = +refs/heads/*:refs/remotes/rc/*

and then just use git fetch rc.

rc · 2018-10-09T19:16:28Z

OK, I accepted your invitation, and pushed my commit to this PR. Let's see :)

sklearn-lgtm · 2018-10-09T19:43:59Z

This pull request introduces 1 alert when merging 7db19f8 into 03c3af5 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

rc · 2018-10-09T20:04:42Z

OK, I will try debugging the errors (probably tomorrow).

Concerning the single LGTM alert Keyword argument 'matmat' is not a supported parameter name of LinearOperator.__init__, it is wrong: the LinearOperator from scipy.sparse.linalg does have the matmat argument. I wonder if this is caused by having __all__ = ['lobpcg'] in lobpcg.py?

rc · 2018-10-09T20:12:15Z

Concerning the single LGTM alert Keyword argument 'matmat' is not a supported parameter name of LinearOperator.__init__, it is wrong: the LinearOperator from scipy.sparse.linalg does have the matmat argument. I wonder if this is caused by having __all__ = ['lobpcg'] in lobpcg.py?

Sorry, I have no idea, even what __all__ = ['lobpcg'] in lobpcg.py possibly means.

This means that lobpcg is the only symbol visible, when the module is imported.
Anyway, we probably do not need to worry about the alert.

jnothman · 2018-10-10T09:34:09Z

Don't worry about the lgtm.com alerts

sklearn-lgtm · 2018-10-10T15:14:32Z

This pull request introduces 1 alert when merging d435213 into 00c2f41 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-10T19:45:26Z

This pull request introduces 1 alert when merging eea2083 into 00c2f41 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-10T20:51:30Z

This pull request introduces 1 alert when merging 3661ef5 into 39bd736 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-10T22:06:37Z

This pull request introduces 1 alert when merging 7f9ab5a into 39bd736 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-10T22:39:08Z

This pull request introduces 1 alert when merging 4c93c2f into 39bd736 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

sklearn-lgtm · 2018-10-11T01:08:58Z

This pull request introduces 1 alert when merging 11ef291 into 39bd736 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

rc · 2018-10-11T05:18:04Z

@lobpcg I have fetched your changes, thanks.

I do not really understand the sklearn.utils.lobpcg.as2d arg mismatch: ['ar'] kind errors - It is maybe that the function does not have Parameters section describing the input? I will try checking/fixing that.

sklearn-lgtm · 2018-10-11T12:28:26Z

This pull request introduces 1 alert when merging 2c2e5bb into 8985a63 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

rc · 2018-10-11T12:37:56Z

@lobpcg in what way? Seems OK for me. BTW. I have fixed the docstring errors.

lobpcg · 2018-10-11T14:26:25Z

I call git in two alternative ways, both GUI, locally and might have created a conflict locally.

BTW. I have fixed the docstring errors.

I of course noticed, the fixes are way above my head, great work!

All tests are passed now, except for LGTM, which are recommended to be ignored. There is nothing else trivial I can think of is left for us to do. I removed [WIP]. Should I put [MRG]?

lobpcg.py is well improved, thanks to your editing. Do you want to PR it to scipy? Add to the existing scipy PR?

BTW, in scipy/scipy#9275 I suggest adding lobpcg_svd to scipy, where it naturally belongs. It now can probably be just mostly copy/pasted from https://github.com/lobpcg/scikit-learn/blob/lobpcg_svd/sklearn/utils/extmath.py to scipy...

rc · 2018-10-12T07:26:02Z

I call git in two alternative ways, both GUI, locally and might have created a conflict locally.

The easiest way of fixing that is to fetch from github, and hard-reset your branch (master?) to the fetched one, in case you do not have local changes. If you have non-commited local changes, you can stash them first (git stash). Command-line guy talking :) Let me know if you need help.

lobpcg.py is well improved, thanks to your editing. Do you want to PR it to scipy? Add to the existing scipy PR?

It might be easier to create a new PR, after scipy/scipy#9352 is merged. Of course, just in case scipy maintainers agree to merge it. I will ping them to get some opinions.

BTW, in scipy/scipy#9275 I suggest adding lobpcg_svd to scipy, where it naturally belongs. It now can probably be just mostly copy/pasted from https://github.com/lobpcg/scikit-learn/blob/lobpcg_svd/sklearn/utils/extmath.py to scipy...

OK.

rc · 2018-10-12T14:40:22Z

I do not really understand the merge conflict - we just added a function, and the function logsumexp() below it was removed in the master - removing logsumexp() in this branch did not help resolving the problem.

lobpcg · 2019-10-25T23:27:17Z

To my vague recollection, the cost/iteration should be m^2n+9mn^2+n^3 in LOBPCG vs probably something like m^2n+mn^2+n^3 in randomized. When m/n is large, the difference 8mn^2 is well dominated by the common first term m^2n, so the cost per iteration is nearly the same, while LOBPCG should always converge faster, often MUCH faster, than randomized, which may become visible even after only a few iterations. The speed improvement depends on the distribution of the singular values, not directly on m/n, that is why it should be inconsistent between different datasets.

So the reported tests results for "large" m/n are as expected.

Another consideration: "Direct solve" eigh uses n=m and takes m^3 (no iterations) in LAPACK. When m/n is small, like 10, and you need to iterate at least 10 times, it's much better just to run plain LAPACK eigh, so testing this PR on m/n<10 makes little practical sense. In fact, if m/n<5 LOBPCG algorithm does not even run - the lobpcg code simply switches inside to the standard LAPACK solver, moreover, does it very inefficiently, see scipy/scipy#10983, (the user is not expected to run lobpcg for over 20% of eigenvalues, so nobody paid attention to this before).

The bottom line is that LOBPCG cannot lose to randomized, unless m/n is small, but then they both should lose to eigh

lobpcg · 2019-11-10T19:12:15Z

scipy/scipy#9275 is now merged, adding lobpcg_svd to scipy

lobpcg · 2019-11-29T16:29:58Z

@glemaitre or anybody else:
I am trying to complete this effort somehow, as we have invested lots of time into it. But I am blocked.

I have just updated https://github.com/lobpcg/scikit-learn/tree/master from https://github.com/scikit-learn/scikit-learn

I am now trying to update this branch https://github.com/lobpcg/scikit-learn/tree/lobpcg_svd from https://github.com/lobpcg/scikit-learn/tree/master but cannot due to

Conflicting files
sklearn/decomposition/pca.py
sklearn/decomposition/truncated_svd.py
sklearn/manifold/spectral_embedding.py

which I do not know how to resolve. Could you please help to resolve the Conflicting files so that I can bring this branch up to date with https://github.com/lobpcg/scikit-learn/tree/master . I just use GitHub Desktop, not really familiar with git commands.

lobpcg · 2019-12-01T03:56:11Z

@thomasjpfan -thanks very much for your help!

There are some docstring errors/warnings that I still cannot figure out how to fix,

Sphinx Warnings in affected files

    sklearn/decomposition/_pca.py:docstring of sklearn.decomposition.PCA:79: WARNING: Inline emphasis start-string without end-string.
    sklearn/decomposition/_pca.py:docstring of sklearn.decomposition.PCA:88: WARNING: Unexpected indentation.
    sklearn/decomposition/_pca.py:docstring of sklearn.decomposition.PCA:189: WARNING: Unexpected indentation.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:32: WARNING: Unexpected indentation.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:33: WARNING: Block quote ends without a blank line; unexpected unindent.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:34: WARNING: Inline emphasis start-string without end-string.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:58: WARNING: Unexpected indentation.

But I guess they can be just ignored for now.

dkobak · 2020-01-06T14:03:48Z

Just chiming in to support this PR!

dkobak · 2020-01-06T14:06:59Z

@lobpcg If merged, this would work correctly for a sparse input into PCA (i.e. implicitly center the input), right?

lobpcg · 2020-01-06T15:08:36Z

@lobpcg If merged, this would work correctly for a sparse input into PCA (i.e. implicitly center the input), right?

@dkobak Thanks for your support!

Yes, this PR surely also works correctly for a sparse input into PCA, adding a new solver 'lobpcg' within scikit. However, it is an independent separate issue from and does not supersede the implicit centering the input like in #12794.

zachmayer · 2020-07-15T19:58:55Z

@lobpcg this pr seems stalled. Do you need any help getting it merged? I'd be happy to contribute commits

lobpcg · 2020-07-16T21:25:46Z

@zachmayer yes, thanks for your willingness to help! The known work remained is as follows:

bring the repo up to date, resolving all the conflicts accumulated since the project has stalled
modify the tests developed by @glemaitre to remove the datasets with irrelevant sizes to this PR and add relevant cases, where the code actually runs in a non-trivial way
add another comparison with, may be use of, the similar new feature: adding LOBPCG solver in svds in addition to ARPACK scipy/scipy#9275 change made while this project has been on hold.
update documentation accordingly with the info from (2) and (3).
convince the admins to actually merge.

Still interested to help getting it merged?

zachmayer · 2020-07-17T14:23:03Z

@lobpcg I'll take a look. Could you elaborate on step 3?

I may be able to at least help resolve the merge conflicts

lobpcg · 2020-07-17T14:39:14Z

@lobpcg I'll take a look. Could you elaborate on step 3?

scipy/scipy#10830 has implemented LOBPCG solver in svds in scipy while this PR being stalled. Which gives yet another option to add lobpcg svd_solver to PCA and TruncatedSVD simply by calling https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html with solver=‘lobpcg’

zachmayer · 2020-07-17T15:05:36Z

Ahh, yeah calling the scipy solver with solver=‘lobpcg’ may simplify this a lot

lobpcg · 2020-07-17T16:22:19Z

Ahh, yeah calling the scipy solver with solver=‘lobpcg’ may simplify this a lot

Yes, but one needs to consider that:

the current PR has everything already coded and working, while replacing some large parts of the code with the scipy solver with solver=‘lobpcg’ is quite a bit of changes.
the scipy solver with solver=‘lobpcg’ setup should be similar to that used in this PR, but they are algorithmically different each having its own (dis-)advantages. So better both are being tested, before deciding on keeping one of them or even both.
hard-coding scipy solver with solver=‘lobpcg’ as the only option in sklearn adds a dependency on scipy 1.4.0 that may be too recent for sklearn.

ogrisel · 2020-11-27T17:35:46Z

hard-coding scipy solver with solver=‘lobpcg’ as the only option in sklearn adds a dependency on scipy 1.4.0 that may be too recent for sklearn.

I think it's fine for a new feature to be only available for newer versions of the dependency as long as we raise a ValueError with a clear and informative error message if this non-default option is selected by the user and its scipy version is too old.

lobpcg · 2021-10-30T19:45:36Z

scipy/scipy#10830 has implemented LOBPCG solver in svds in scipy while this PR being stalled. Which allows using lobpcg svd_solver to PCA and TruncatedSVD simply by calling https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html with solver=‘lobpcg’

dkobak · 2021-10-30T20:46:03Z

I think I am a bit confused here @lobpcg: the scipy PR was merged two years ago, so why are you only closing this PR now, two years later?

In any case, thanks a lot for your efforts!

lobpcg · 2021-10-30T22:21:18Z

I thought about coming back to it, factoring in the SciPy implementation, but never found time and then forgot about it.
My other pr from that time was simpler, so I rejuvenated it recently for some external reasons and it may be now merged. Working on it reminded me about this one and helped me to realize to better just close it.

glemaitre · 2021-11-02T09:42:04Z

The comment of @ogrisel gives the way forward since this is merged in SciPy: #12319 (comment)

Basically, we have two possibilities: either backport the method in sklearn.utils.fixes or raise an error mentioning to upgrade SciPy to a newer version to have the solver available. The second solution is the one used in QuantileRegressor for some solvers. It seems that it is the one that would require the least amount of code and maintenance on our side.

So much time wasted, unbelievable!

I am pretty upset when reading these types of comments that I find really indecent.

lobpcg · 2021-11-02T11:32:31Z

I surely did not mean to upset anyone, so I apologize.

This was referenced Oct 7, 2018

BUG: Fix lobpcg() largest option scipy/scipy#9352

Merged

new feature: add LOBPCG as an SVD solver in PCA #12079

Closed

lobpcg changed the title ~~[WIP] added lobpcg_svd solver to pca and truncated_svd~~ added 'lobpcg_svd' solver to pca and 'truncated_svd with all extra tests as for 'randomized' solver Oct 11, 2018

lobpcg changed the title ~~added 'lobpcg_svd' solver to pca and 'truncated_svd with all extra tests as for 'randomized' solver~~ added 'lobpcg_svd' solver to pca and truncated_svd with all extra tests as for 'randomized' solver Oct 11, 2018

thomasjpfan and others added 4 commits November 29, 2019 12:29

Merge remote-tracking branch 'upstream/master' into pr/12319

4da2e77

remove duplicated References section

bd1755e

Notes fixed

1ef74b1

fixing the docstring format

c3b612b

cmarmo added module:decomposition Needs Decision Requires decision labels Oct 22, 2020

Base automatically changed from master to main January 22, 2021 10:50

lobpcg closed this Oct 30, 2021

lobpcg deleted the lobpcg_svd branch October 30, 2021 19:45

lobpcg mentioned this pull request Sep 15, 2022

How do arpack, propack and lobcg compare to each other? scipy/scipy#17023

Open

[MRG] add lobpcg svd_solver to PCA and TruncatedSVD #12319

[MRG] add lobpcg svd_solver to PCA and TruncatedSVD #12319

Conversation

lobpcg commented Oct 7, 2018 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

sklearn-lgtm commented Oct 7, 2018

sklearn-lgtm commented Oct 8, 2018

sklearn-lgtm commented Oct 8, 2018

sklearn-lgtm commented Oct 8, 2018

sklearn-lgtm commented Oct 8, 2018

rc commented Oct 8, 2018 • edited Loading

sklearn-lgtm commented Oct 9, 2018

rc commented Oct 9, 2018

rc commented Oct 9, 2018

rc commented Oct 9, 2018

sklearn-lgtm commented Oct 9, 2018

rc commented Oct 9, 2018

rc commented Oct 9, 2018

jnothman commented Oct 10, 2018 via email

sklearn-lgtm commented Oct 10, 2018

sklearn-lgtm commented Oct 10, 2018

sklearn-lgtm commented Oct 10, 2018

sklearn-lgtm commented Oct 10, 2018

sklearn-lgtm commented Oct 10, 2018

sklearn-lgtm commented Oct 11, 2018

rc commented Oct 11, 2018

sklearn-lgtm commented Oct 11, 2018

rc commented Oct 11, 2018

lobpcg commented Oct 11, 2018 • edited Loading

rc commented Oct 12, 2018

rc commented Oct 12, 2018

lobpcg commented Oct 25, 2019 • edited Loading

lobpcg commented Nov 10, 2019

lobpcg commented Nov 29, 2019

lobpcg commented Dec 1, 2019 • edited Loading

dkobak commented Jan 6, 2020

dkobak commented Jan 6, 2020

lobpcg commented Jan 6, 2020

zachmayer commented Jul 15, 2020

lobpcg commented Jul 16, 2020

zachmayer commented Jul 17, 2020

lobpcg commented Jul 17, 2020

zachmayer commented Jul 17, 2020

lobpcg commented Jul 17, 2020

ogrisel commented Nov 27, 2020

lobpcg commented Oct 30, 2021 • edited Loading

dkobak commented Oct 30, 2021

lobpcg commented Oct 30, 2021

glemaitre commented Nov 2, 2021

lobpcg commented Nov 2, 2021

lobpcg commented Oct 7, 2018 •

edited

Loading

rc commented Oct 8, 2018 •

edited

Loading

lobpcg commented Oct 11, 2018 •

edited

Loading

lobpcg commented Oct 25, 2019 •

edited

Loading

lobpcg commented Dec 1, 2019 •

edited

Loading

lobpcg commented Oct 30, 2021 •

edited

Loading