New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] ENH: optimizing power iterations phase for randomized_svd #5141

Merged
merged 1 commit into from Oct 21, 2015

Conversation

Projects
None yet
@giorgiop
Contributor

giorgiop commented Aug 19, 2015

I am writing some benchmarks to see if we can set a default number of power iterations for `utils.extmath.randomized_SVD'. The function implements the work of Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions, Halko, et al., 2009 http://arxiv.org/abs/arXiv:0909.4061

See the code for mode details.

Still to do:

  • review unit tests for randomized_svd
  • benchmark if we can find an absolute number regardless of input matrix rank , e.g. is 1 always better than 0. In other words, can we find a small integer such that we always improve approximation quality while not making the algorithm significantly slower?
  • extract at least 5 components in every experiment
  • extend randomized_svd with normalized power iterations with QR at each step, and another version with LU
  • sample the random matrix from U[-1,1] instead of from a Gaussian
  • docs

And finally

  • set a default value for n_iter in randomized_svd and a default strategy for normalizing matrices between power iterations
  • review all use of randomized_svd (in TruncatedSVD and RandomizedPCA)
@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 19, 2015

  • singular values distance = squared L2 norm of the difference between singular values from standard SVD and approximated ones from randomized_svd
  • Frobenius distance = Frobenius norm of the difference between original input matrix and matrix product USV, output of randomized_svd
  • noise component = parameter in [0,1] input of `make_low_rank_matrix
  • the size of the dots in the last plots is proportional to the number of power iterations used for that run --growing to the right

figure_1
figure_2
figure_3
figure_4
figure_5
figure_6
figure_7
figure_8
figure_9
figure_10

So far, plots suggest 2 may be the choice. Larger number may even degrade quality of the approximation. The flat behaviour of the last two plots come from very small datasets.

Ping @ogrisel

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 20, 2015

This looks great although the plots on diabetes and abalone are not very interesting as they extract a single component. Maybe you should force to extract at least 5 components to make them more relevant.

plt.legend(loc="lower right")
plt.axhline(y=0, ls=':', c='black')
plt.suptitle("%s: distances from SVD vs. running time\n \
fraction of compontens used in rand_SVD = %i" %

This comment has been minimized.

@lesteve

lesteve Aug 20, 2015

Member

Typo: components.

Also from your plots, it looks like this is the number of components and not the fraction right?

This comment has been minimized.

@giorgiop

giorgiop Aug 20, 2015

Contributor

Yes, I always take 5%. That's the absolute number. Thanks.

plt.suptitle("Singular values distance vs. noise\n \
matrix shape = %i x %i, rank = %i, n comps in rand_SVD = %i" %
(data.shape[0], data.shape[1], rank, n_comps))
plt.ylabel("Singual values l2 distance")

This comment has been minimized.

@lesteve

lesteve Aug 20, 2015

Member

typo: Singual -> Singular

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 21, 2015

scipy.sparse.linalg.svds does decrease a little time consumption, but the singular values and singular vectors returned are sorted differently from randomized_SVD and scipy.linald.svd. I would stick with the original, there is not much gain here anyway.

I have coded a last test. Looking to the 2 new plots, I think we can be confident that 2 is a good default value. Input was (500, 5000). Rank is the rank of the input matrix, n_comps is the parameter of randomized_svd.

figure_19
figure_20

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 21, 2015

For completeness, here the other plots updated. (There is no change on the small datasets, even computing 5 components.)

figure_1
figure_2
figure_3
figure_4
figure_5
figure_6
figure_7
figure_8
figure_9
figure_10

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 24, 2015

This is great. Thanks for this empirical study. +1 for using 2 power iterations in randomized_svd by default.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 24, 2015

I wonder if we should change the default value for RandomizedPCA and TruncatedSVD from 3 to 2 as well:

  • on the plus side: this would give a non-negligible speed up with similar accuracy on most datasets,
  • on the negative side: it can cause a slight change of behavior that users upgrading from a previous version of scikit-learn might not expect.

Any opinion by others, e.g. @larsmans?

@amueller

This comment has been minimized.

Member

amueller commented Aug 24, 2015

How do we compare against fbpca? https://github.com/facebook/fbpca ? That also just uses power iterations, right?

@mblondel

This comment has been minimized.

Member

mblondel commented Aug 25, 2015

Another thing which would be nice to investigate if you have time: our implementation of range_finder uses Algorithm 4.3 but according to the paper the procedure is numerically unstable. The paper suggests a subspace iteration method in Algorithm 4.4. It would be interesting to compare both algorithms in terms of accuracy and computational time.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 31, 2015

How do we compare against fbpca? https://github.com/facebook/fbpca ? That also just uses power iterations, right?

It does. And it also uses 2 power iterations by default.

I ran a few tests. Sklearn seems slower. Performance is similar, but sklearn is slightly better here. A difference is that fbpca's performance does not deteriorate when the number of power iterations is "too" large. (I did not looked into the code.)

Code on this gist. Graphs attached.

figure_1
figure_2

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 31, 2015

I wonder if we should change the default value for RandomizedPCA and TruncatedSVD from 3 to 2 as well:

What people think about this? @ogrisel @larsmans I am happy to have a look into it.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 31, 2015

Sklearn seems slower. Performance is similar, but sklearn is slightly better here.

I don't understand this, could you please rephrase?

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 31, 2015

Also could you please annotate the scatter plot with the number of power iterations for each dot?

See for instance: http://stackoverflow.com/questions/5147112/matplotlib-how-to-put-individual-tags-for-a-scatter-plot (no need for arrows)

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 31, 2015

Just commenting on the two plots. Sklearn runs slower than fbpca. But performance in term of matrix factorization error is slighly better than fbpca --until sklearn starts to degrade, for lfw_people.

Also could you please annotate the scatter plot with the number of power iterations for each dot?

Sure.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 31, 2015

The difference is how the power iterations are computed, see:

https://github.com/facebook/fbpca/blob/master/fbpca.py#L1562
vs
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L226

A LU factorization lu(Q, permute_l=True) before each power iteration.

It would be worth checking the paper: maybe our code does not do it because of an oversight on our part. It seems to make the power iteration more stable: too many power iterations do not seem to cause a degradation in the quality of the approximation when the LU steps are there.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 31, 2015

Maybe @ajtulloch would be interested in following this discussion :)

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 31, 2015

Plots updates
figure_1
figure_2

@kastnerkyle

This comment has been minimized.

Member

kastnerkyle commented Aug 31, 2015

I think fbpca switches "modes" depending on the number of components - IIRC
from looking at the code they switch to doing "full" pca and truncating if
the number of components is >X% of the total. Are we comparing only to the
randomized solver of fbpca or to the external interface (which may switch
modes internally)?

On Mon, Aug 31, 2015 at 10:32 AM, Giorgio Patrini notifications@github.com
wrote:

Plots updates
[image: figure_1]
https://cloud.githubusercontent.com/assets/2871319/9581136/d6fa1710-4ffd-11e5-89da-de7cef990135.png
[image: figure_2]
https://cloud.githubusercontent.com/assets/2871319/9581137/d6fe918c-4ffd-11e5-804e-94f612e59d16.png


Reply to this email directly or view it on GitHub
#5141 (comment)
.

@amueller

This comment has been minimized.

Member

amueller commented Aug 31, 2015

and should we also switch modes internally?

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 1, 2015

I think fbpca switches "modes" depending on the number of components - IIRC from looking at the code they switch to doing "full" pca and truncating if the number of components is >X% of the total. Are we comparing only to the randomized solver of fbpca or to the external interface (which may switch modes internally)?

This is only the case when n_components is very close min(n_samples, n_features). We could implement that strategy in sklearn as well (as a safety belt) but in practice I don't think it matters as on large enough data you can only do PCA with n_components << min(n_samples, n_features) (otherwise the SVD is too expensive).

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 1, 2015

Looking into the code, the 3 differences I see are:

  • mode switching when n_comp > 4/5 min(n_samples, n_features).

Are we comparing only to the randomized solver of fbpca or to the external interface (which may switch modes internally)?

I am comparing with the external interface pca but I do not think there is any difference, except passing through n_comp > 4/5 min(n_samples, n_features). This was always false in what I ran.

  • a LU factorization between every multiplication of A and A^T in the power iterations. This may be the reason why sklearn runs slightly faster. And this is very likely the reason of the increasing error after many power iterations, replying to @mblondel. From Halko et al.:

Unfortunately, when Algorithm 4.3 is executed in floating point arithmetic, rounding errors will extinguish all information pertaining to singular modes associated with singular values that are small compared with A . (Roughly, if machine precision is μ, then all information associated with singular values smaller
than μ 1/(2q+1) A is lost.)

The paper suggests QR factorizations anyway.

  • fbpca does not factor the cases m>n and m<n. I guess the support to complex input does not make it so straightforward, as it is in sklearn that does .T at the beginning.
@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 1, 2015

This may be the reason why sklearn runs slightly faster.

Actually on LFW, sklearn is slightly slower, not faster. Maybe the final QR is less expensive when the LU is done in the intermediate steps? It would be great to confirm with an experiment by adding an option to add the LU steps in our codebase.

And this is very likely the reason of the increasing error after many power iterations.

This looks very likely indeed.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 1, 2015

The implementation in dask.linalg by @marianotepper does vanilla power iterations (no LU):

https://github.com/blaze/dask/blob/master/dask/array/linalg.py#L225

@mblondel

This comment has been minimized.

Member

mblondel commented Sep 1, 2015

It would be interesting to compare QR and LU decompositions.

Another thing I am curious is whether it would be worth using a stopping criterion for the loop in the range finder. Halko et al. suggest ||A - QQ^TA||. This can however be expensive to compute.

@ajtulloch

This comment has been minimized.

Contributor

ajtulloch commented Sep 2, 2015

Cool, great discussion! LMK if we (Mark Tygert and I wrote fbpca) can contribute in any way. There's a fairly detailed evaluation at http://tygert.com/implement.pdf if that's useful as well.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 2, 2015

Thanks for the very useful reference @ajtulloch! I am going to make a benchmark file for playing with different versions of the power iterations. I will keep comparing with your fbpca.

@giorgiop giorgiop changed the title from [WIP] ENH: default # power iterations for randomized_svd to [WIP] ENH: optimizing power iterations phase for randomized_svd Sep 2, 2015

@amueller

This comment has been minimized.

Member

amueller commented Sep 2, 2015

thanks @ajtulloch :)
I wonder if we shouldn't also go the route of doing case-switching in PCA and removing RandomizedPCA. From a user perspective it is non-obvious when to use which.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Sep 2, 2015

@amueller

This comment has been minimized.

Member

amueller commented Oct 9, 2015

To summarize the changes here, we went from 3 iterations by default to two iterations, and have now an "auto" mode to apply LU normalization for more than 2 iterations, right?
So by default we are now slightly less accurate and somewhat faster.
Is that a fair summary? [should go to whatsnew]

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 11, 2015

To summarize the changes here, we went from 3 iterations by default to two iterations, and have now an "auto" mode to apply LU normalization for more than 2 iterations, right?
So by default we are now slightly less accurate and somewhat faster.

Yes and yes.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 11, 2015

We also do more over sampling than fbpca so tend to be more accurate than fbpca in general.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 12, 2015

Stupid question that maybe has been answered above: Why are we still slower than fbpca for example for 20 newsgroups for 0 and 1 iteration?

Because we use n_oversamples=10 by default which tends to increase the accuracy in general. fbpca uses the equivalent of n_oversamples=2 which can cause accuracy issues on some datasets.

@lesteve

This comment has been minimized.

Member

lesteve commented Oct 12, 2015

To summarize the changes here, we went from 3 iterations by default to two iterations

Small remark, @amueller's comment only applies to RandomizedPCA. If you use sklearn.utils.randomized_svd directly as is the case in nilearn, you go from n_iter=0 to n_iter=2 by default.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 12, 2015

What's new:

  • randomized_svd default is now 2 power iterations instead of 0
  • randomized_range_finder default is to run normalized power iterations with LU, but only if n_iter < 2
  • PCA default is now 2 power iterations instead of 3

The comparison with fbpca is complex. fbpca runs with n_iter=2, LU normalization (always) and n_oversamples=2. There is another important difference that impacts runtme, which is the transpose mode discussed in #4478.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 12, 2015

Stupid question that maybe has been answered above: Why are we still slower than fbpca for example for 20 newsgroups for 0 and 1 iteration?

Because we use n_oversamples=10 by default which tends to increase the accuracy in general. fbpca uses the equivalent of n_oversamples=2 which can cause accuracy issues on some datasets.

It's true that we are slower, and even with the same n_oversamples actually. The tests with 4 curves on individual datasets are run with n_oversamples=2 for all algorithms (bench_a in the code).

I have excluded any influence of the bugged transpose='auto' policy, always setting transpose=False in the tests. This means that when we would gain time by transposition, we are not: the case is with fat matrices, as 20newsgroups is. fbpca implements the transposition strategy correctly. I believe this is the reason of the slow down here.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 12, 2015

whats_new.rst updated.

@amueller

This comment has been minimized.

Member

amueller commented Oct 12, 2015

lgtm.

@kastnerkyle

This comment has been minimized.

Member

kastnerkyle commented Oct 20, 2015

This lgtm as well

@glouppe

This comment has been minimized.

Member

glouppe commented Oct 20, 2015

The what's new entries should be put under 0.18.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 20, 2015

The what's new entries should be put under 0.18.

Sure. I was hoping for a merge in 0.17 ;)
I will update. I am also working on speeding up the benchmarks with the computation of approximate spectral norms as discussed above.

Also, we should have a common default policy for the power iterations (n_iter and n_oversampling) with the other PCA open PR ##5299. Don't we?

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 20, 2015

whats_new updated, with some other minors.

I did not implement the faster spectral norm estimate. That measure depends on random initialization, which make benchmarking trickier (need many runs and averaged performance etc.).

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 21, 2015

Done here.

@kastnerkyle kastnerkyle changed the title from [MRG + 1] ENH: optimizing power iterations phase for randomized_svd to [MRG + 2] ENH: optimizing power iterations phase for randomized_svd Oct 21, 2015

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 21, 2015

+1 for merging this now and re-adjusting the default params for consistency in the PCA collapse PR if needed.

ogrisel added a commit that referenced this pull request Oct 21, 2015

Merge pull request #5141 from giorgiop/power-iter-randomized-svd
[MRG + 2] ENH: optimizing power iterations phase for randomized_svd

@ogrisel ogrisel merged commit 0cb93b0 into scikit-learn:master Oct 21, 2015

1 of 2 checks passed

continuous-integration/appveyor Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 21, 2015

🍋cello!

@kastnerkyle

This comment has been minimized.

Member

kastnerkyle commented Oct 21, 2015

🍻

@amueller

This comment has been minimized.

Member

amueller commented Oct 21, 2015

:craft beer:

In practice this is often enough for obtaining a good approximation of the
true eigenvalues/vectors in the presence of noise. By `Giorgio Patrini`_.
- :func:`randomized_range_finder` is more numerically stable when many

This comment has been minimized.

@amueller

amueller Oct 21, 2015

Member

can you please add a link to the PR to whatsnew? And versionadded hints for the changed parameters would be nice ;)

This comment has been minimized.

@kastnerkyle

kastnerkyle Oct 21, 2015

Member

This you added this in the wrong diff @amueller

This comment has been minimized.

@kastnerkyle

kastnerkyle Oct 21, 2015

Member

ok, now I get it @amueller . just give me time :)

@giorgiop giorgiop deleted the giorgiop:power-iter-randomized-svd branch Nov 3, 2015

@giorgiop giorgiop restored the giorgiop:power-iter-randomized-svd branch Feb 21, 2016

@tygert tygert referenced this pull request Jul 16, 2018

Closed

Distributed / Spark version #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment