Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add lobpcg svd_solver to PCA and TruncatedSVD #12319

Closed
wants to merge 119 commits into from
Closed

[MRG] add lobpcg svd_solver to PCA and TruncatedSVD #12319

wants to merge 119 commits into from

Conversation

lobpcg
Copy link
Contributor

@lobpcg lobpcg commented Oct 7, 2018

Reference Issues/PRs

fixes #12079, fixes #12080

What does this implement/fix? Explain your changes.

#12079 adds LOBPCG as an SVD solver in PCA
#12080 adds LOBPCG solver to Truncated PCA

lobpcg_svd should also be useful in KernelPCA for faster partial decompositions, see #12068

This PR also includes multiple LOBPCG related bug fixes, including vendoring sklearn/externals/_lobpcg.py from scipy 1.3.0

Any other comments?

@ogrisel Transferred from permanently closed PR #12291

Keep in mind for testing, that lobpcg_svd falls back to dense eigensolver unless n_components < 3*matrix_size, where matrix_size = min (n_samples, n_features)

Still to do, better in new focused PRs after this one is merged

  1. example plot_faces_decomposition may include lobpcg_svd, just change

    ('Eigenfaces - PCA using randomized SVD',
    decomposition.PCA(n_components=n_components, svd_solver='randomized',
    whiten=True),
    True),

to

('Eigenfaces - PCA using randomized SVD',
 decomposition.PCA(n_components=n_components, svd_solver='lobpcg',
                   whiten=True),
 True),

but lobpcg currently fails here for unclear numerical reasons. More testing may be needed for float32 data, like in this example.

  1. All four existing TruncatedSVD examples of scikit-learn in the examples/ folder do run with lobpcg, just by adding the option ", algorithm='lobpcg' " to TruncatedSVD function call. But none generates the matrix large enough to demonstrate the practical benefits of lobpcg_svd.

@sklearn-lgtm
Copy link

This pull request introduces 5 alerts when merging 539dfd4 into 63e5ae6 - view on LGTM.com

new alerts:

  • 1 for Non-callable called
  • 1 for Unused import
  • 1 for 'input' function used
  • 1 for Module is imported with 'import' and 'import from'
  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 2beaafc into 63e5ae6 - view on LGTM.com

new alerts:

  • 1 for Non-callable called
  • 1 for 'input' function used
  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 5375a48 into 63e5ae6 - view on LGTM.com

new alerts:

  • 1 for 'input' function used
  • 1 for Module is imported with 'import' and 'import from'
  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 17c9196 into 4e2e1fa - view on LGTM.com

new alerts:

  • 1 for 'input' function used
  • 1 for Module is imported with 'import' and 'import from'
  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging 3546217 into 5fd9e03 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'
  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@rc
Copy link

rc commented Oct 8, 2018

The development docs say: "The full scikit-learn tests can be run using ‘make’ in the root folder. Alternatively, running ‘pytest’ in a folder will run all the tests of the corresponding subpackages." This works for me:

pytest sklearn/decomposition/

after building the package in-place python setup.py build_ext -i.

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging d658176 into 5fd9e03 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'
  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@rc
Copy link

rc commented Oct 9, 2018

OK, I have your branch now. I am looking at the LGTM alerts.

@rc
Copy link

rc commented Oct 9, 2018

I am not sure how you want to proceed with working on this branch together, so here is my proposal:

  • I have just pushed a small fix to https://github.com/rc/scikit-learn/, branch lobpcg_svd.
  • Now you can add my repo to your .git/config and fetch from it (see below).
  • Then look at my changes and eventually merge them into your branch, and update this PR.
  • I will always fetch your branch before making anything to stay up-to-date.

You can add my repo by adding the following to your .git/config:

[remote "rc"]
	url = https://github.com/rc/scikit-learn.git
	fetch = +refs/heads/*:refs/remotes/rc/*

and then just use git fetch rc.

@rc
Copy link

rc commented Oct 9, 2018

OK, I accepted your invitation, and pushed my commit to this PR. Let's see :)

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 7db19f8 into 03c3af5 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@rc
Copy link

rc commented Oct 9, 2018

OK, I will try debugging the errors (probably tomorrow).

Concerning the single LGTM alert Keyword argument 'matmat' is not a supported parameter name of LinearOperator.__init__, it is wrong: the LinearOperator from scipy.sparse.linalg does have the matmat argument. I wonder if this is caused by having __all__ = ['lobpcg'] in lobpcg.py?

@rc
Copy link

rc commented Oct 9, 2018

Concerning the single LGTM alert Keyword argument 'matmat' is not a supported parameter name of LinearOperator.__init__, it is wrong: the LinearOperator from scipy.sparse.linalg does have the matmat argument. I wonder if this is caused by having __all__ = ['lobpcg'] in lobpcg.py?

Sorry, I have no idea, even what __all__ = ['lobpcg'] in lobpcg.py possibly means.

This means that lobpcg is the only symbol visible, when the module is imported.
Anyway, we probably do not need to worry about the alert.

@jnothman
Copy link
Member

jnothman commented Oct 10, 2018 via email

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging d435213 into 00c2f41 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging eea2083 into 00c2f41 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 3661ef5 into 39bd736 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 7f9ab5a into 39bd736 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 4c93c2f into 39bd736 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 11ef291 into 39bd736 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@rc
Copy link

rc commented Oct 11, 2018

@lobpcg I have fetched your changes, thanks.

I do not really understand the sklearn.utils.lobpcg.as2d arg mismatch: ['ar'] kind errors - It is maybe that the function does not have Parameters section describing the input? I will try checking/fixing that.

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 2c2e5bb into 8985a63 - view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

Comment posted by LGTM.com

@rc
Copy link

rc commented Oct 11, 2018

@lobpcg in what way? Seems OK for me. BTW. I have fixed the docstring errors.

@lobpcg lobpcg changed the title [WIP] added lobpcg_svd solver to pca and truncated_svd added 'lobpcg_svd' solver to pca and 'truncated_svd with all extra tests as for 'randomized' solver Oct 11, 2018
@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 11, 2018

I call git in two alternative ways, both GUI, locally and might have created a conflict locally.

BTW. I have fixed the docstring errors.

I of course noticed, the fixes are way above my head, great work!

All tests are passed now, except for LGTM, which are recommended to be ignored. There is nothing else trivial I can think of is left for us to do. I removed [WIP]. Should I put [MRG]?

lobpcg.py is well improved, thanks to your editing. Do you want to PR it to scipy? Add to the existing scipy PR?

BTW, in scipy/scipy#9275 I suggest adding lobpcg_svd to scipy, where it naturally belongs. It now can probably be just mostly copy/pasted from https://github.com/lobpcg/scikit-learn/blob/lobpcg_svd/sklearn/utils/extmath.py to scipy...

@lobpcg lobpcg changed the title added 'lobpcg_svd' solver to pca and 'truncated_svd with all extra tests as for 'randomized' solver added 'lobpcg_svd' solver to pca and truncated_svd with all extra tests as for 'randomized' solver Oct 11, 2018
@rc
Copy link

rc commented Oct 12, 2018

I call git in two alternative ways, both GUI, locally and might have created a conflict locally.

The easiest way of fixing that is to fetch from github, and hard-reset your branch (master?) to the fetched one, in case you do not have local changes. If you have non-commited local changes, you can stash them first (git stash). Command-line guy talking :) Let me know if you need help.

lobpcg.py is well improved, thanks to your editing. Do you want to PR it to scipy? Add to the existing scipy PR?

It might be easier to create a new PR, after scipy/scipy#9352 is merged. Of course, just in case scipy maintainers agree to merge it. I will ping them to get some opinions.

BTW, in scipy/scipy#9275 I suggest adding lobpcg_svd to scipy, where it naturally belongs. It now can probably be just mostly copy/pasted from https://github.com/lobpcg/scikit-learn/blob/lobpcg_svd/sklearn/utils/extmath.py to scipy...

OK.

@rc
Copy link

rc commented Oct 12, 2018

I do not really understand the merge conflict - we just added a function, and the function logsumexp() below it was removed in the master - removing logsumexp() in this branch did not help resolving the problem.

@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 25, 2019

To my vague recollection, the cost/iteration should be m^2n+9mn^2+n^3 in LOBPCG vs probably something like m^2n+mn^2+n^3 in randomized. When m/n is large, the difference 8mn^2 is well dominated by the common first term m^2n, so the cost per iteration is nearly the same, while LOBPCG should always converge faster, often MUCH faster, than randomized, which may become visible even after only a few iterations. The speed improvement depends on the distribution of the singular values, not directly on m/n, that is why it should be inconsistent between different datasets.

So the reported tests results for "large" m/n are as expected.

Another consideration: "Direct solve" eigh uses n=m and takes m^3 (no iterations) in LAPACK. When m/n is small, like 10, and you need to iterate at least 10 times, it's much better just to run plain LAPACK eigh, so testing this PR on m/n<10 makes little practical sense. In fact, if m/n<5 LOBPCG algorithm does not even run - the lobpcg code simply switches inside to the standard LAPACK solver, moreover, does it very inefficiently, see scipy/scipy#10983, (the user is not expected to run lobpcg for over 20% of eigenvalues, so nobody paid attention to this before).

The bottom line is that LOBPCG cannot lose to randomized, unless m/n is small, but then they both should lose to eigh

@lobpcg
Copy link
Contributor Author

lobpcg commented Nov 10, 2019

scipy/scipy#9275 is now merged, adding lobpcg_svd to scipy

@lobpcg
Copy link
Contributor Author

lobpcg commented Nov 29, 2019

@glemaitre or anybody else:
I am trying to complete this effort somehow, as we have invested lots of time into it. But I am blocked.

I have just updated https://github.com/lobpcg/scikit-learn/tree/master from https://github.com/scikit-learn/scikit-learn

I am now trying to update this branch https://github.com/lobpcg/scikit-learn/tree/lobpcg_svd from https://github.com/lobpcg/scikit-learn/tree/master but cannot due to

Conflicting files
sklearn/decomposition/pca.py
sklearn/decomposition/truncated_svd.py
sklearn/manifold/spectral_embedding.py

which I do not know how to resolve. Could you please help to resolve the Conflicting files so that I can bring this branch up to date with https://github.com/lobpcg/scikit-learn/tree/master . I just use GitHub Desktop, not really familiar with git commands.

@lobpcg
Copy link
Contributor Author

lobpcg commented Dec 1, 2019

@thomasjpfan -thanks very much for your help!

There are some docstring errors/warnings that I still cannot figure out how to fix,

Sphinx Warnings in affected files

    sklearn/decomposition/_pca.py:docstring of sklearn.decomposition.PCA:79: WARNING: Inline emphasis start-string without end-string.
    sklearn/decomposition/_pca.py:docstring of sklearn.decomposition.PCA:88: WARNING: Unexpected indentation.
    sklearn/decomposition/_pca.py:docstring of sklearn.decomposition.PCA:189: WARNING: Unexpected indentation.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:32: WARNING: Unexpected indentation.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:33: WARNING: Block quote ends without a blank line; unexpected unindent.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:34: WARNING: Inline emphasis start-string without end-string.
    sklearn/decomposition/_truncated_svd.py:docstring of sklearn.decomposition.TruncatedSVD:58: WARNING: Unexpected indentation. 

But I guess they can be just ignored for now.

@dkobak
Copy link
Contributor

dkobak commented Jan 6, 2020

Just chiming in to support this PR!

@dkobak
Copy link
Contributor

dkobak commented Jan 6, 2020

@lobpcg If merged, this would work correctly for a sparse input into PCA (i.e. implicitly center the input), right?

@lobpcg
Copy link
Contributor Author

lobpcg commented Jan 6, 2020

@lobpcg If merged, this would work correctly for a sparse input into PCA (i.e. implicitly center the input), right?

@dkobak Thanks for your support!

Yes, this PR surely also works correctly for a sparse input into PCA, adding a new solver 'lobpcg' within scikit. However, it is an independent separate issue from and does not supersede the implicit centering the input like in #12794.

@zachmayer
Copy link
Contributor

@lobpcg this pr seems stalled. Do you need any help getting it merged? I'd be happy to contribute commits

@lobpcg
Copy link
Contributor Author

lobpcg commented Jul 16, 2020

@zachmayer yes, thanks for your willingness to help! The known work remained is as follows:

  1. bring the repo up to date, resolving all the conflicts accumulated since the project has stalled
  2. modify the tests developed by @glemaitre to remove the datasets with irrelevant sizes to this PR and add relevant cases, where the code actually runs in a non-trivial way
  3. add another comparison with, may be use of, the similar new feature: adding LOBPCG solver in svds in addition to ARPACK scipy/scipy#9275 change made while this project has been on hold.
  4. update documentation accordingly with the info from (2) and (3).
  5. convince the admins to actually merge.

Still interested to help getting it merged?

@zachmayer
Copy link
Contributor

@lobpcg I'll take a look. Could you elaborate on step 3?

I may be able to at least help resolve the merge conflicts

@lobpcg
Copy link
Contributor Author

lobpcg commented Jul 17, 2020

@lobpcg I'll take a look. Could you elaborate on step 3?

scipy/scipy#10830 has implemented LOBPCG solver in svds in scipy while this PR being stalled. Which gives yet another option to add lobpcg svd_solver to PCA and TruncatedSVD simply by calling https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html with solver=‘lobpcg’

@zachmayer
Copy link
Contributor

Ahh, yeah calling the scipy solver with solver=‘lobpcg’ may simplify this a lot

@lobpcg
Copy link
Contributor Author

lobpcg commented Jul 17, 2020

Ahh, yeah calling the scipy solver with solver=‘lobpcg’ may simplify this a lot

Yes, but one needs to consider that:

  • the current PR has everything already coded and working, while replacing some large parts of the code with the scipy solver with solver=‘lobpcg’ is quite a bit of changes.
  • the scipy solver with solver=‘lobpcg’ setup should be similar to that used in this PR, but they are algorithmically different each having its own (dis-)advantages. So better both are being tested, before deciding on keeping one of them or even both.
  • hard-coding scipy solver with solver=‘lobpcg’ as the only option in sklearn adds a dependency on scipy 1.4.0 that may be too recent for sklearn.

@ogrisel
Copy link
Member

ogrisel commented Nov 27, 2020

hard-coding scipy solver with solver=‘lobpcg’ as the only option in sklearn adds a dependency on scipy 1.4.0 that may be too recent for sklearn.

I think it's fine for a new feature to be only available for newer versions of the dependency as long as we raise a ValueError with a clear and informative error message if this non-default option is selected by the user and its scipy version is too old.

Base automatically changed from master to main January 22, 2021 10:50
@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 30, 2021

scipy/scipy#10830 has implemented LOBPCG solver in svds in scipy while this PR being stalled. Which allows using lobpcg svd_solver to PCA and TruncatedSVD simply by calling https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html with solver=‘lobpcg’

@lobpcg lobpcg closed this Oct 30, 2021
@lobpcg lobpcg deleted the lobpcg_svd branch October 30, 2021 19:45
@dkobak
Copy link
Contributor

dkobak commented Oct 30, 2021

I think I am a bit confused here @lobpcg: the scipy PR was merged two years ago, so why are you only closing this PR now, two years later?

In any case, thanks a lot for your efforts!

@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 30, 2021

I thought about coming back to it, factoring in the SciPy implementation, but never found time and then forgot about it.
My other pr from that time was simpler, so I rejuvenated it recently for some external reasons and it may be now merged. Working on it reminded me about this one and helped me to realize to better just close it.

@glemaitre
Copy link
Member

The comment of @ogrisel gives the way forward since this is merged in SciPy: #12319 (comment)

Basically, we have two possibilities: either backport the method in sklearn.utils.fixes or raise an error mentioning to upgrade SciPy to a newer version to have the solver available. The second solution is the one used in QuantileRegressor for some solvers. It seems that it is the one that would require the least amount of code and maintenance on our side.

So much time wasted, unbelievable!

I am pretty upset when reading these types of comments that I find really indecent.

@lobpcg
Copy link
Contributor Author

lobpcg commented Nov 2, 2021

I surely did not mean to upset anyone, so I apologize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new feature: add LOBPCG solver to Truncated PCA new feature: add LOBPCG as an SVD solver in PCA