New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] Collapsing PCA and RandomizedPCA #5299

Merged
merged 1 commit into from Mar 10, 2016

Conversation

Projects
None yet
@giorgiop
Contributor

giorgiop commented Sep 23, 2015

Fixes #5243 and #3930.

to do

  • collapse the two classes
  • old tests passing
  • integrate svd_solver=='arpack'
  • benchmark the 3 solvers and establish the best auto policy
  • fix docstrings
  • uniform with IncrementalPCA by inheritance from _BasePCA; see #3285
  • add flip_sign param, True by default we flip by default, without introducing a param for controlling the behaviour
  • arpack backport updated
@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 23, 2015

I have the feeling that IncrementalPCA has very few common code to share with the regular PCA / _BasePCA classes. IMO we should not try to use inheritance when there is no significant amount of code to reuse. The indirections introduced by inheritance can make it harder to read the code so it has to be justified by code reuse.

@kastnerkyle

This comment has been minimized.

Member

kastnerkyle commented Sep 23, 2015

_BasePCA was originally introduced at the same time as IncrementalPCA with
an eye toward refactoring PCA to share this common code - most of those
methods should hold for PCA, since they were taken from there originally.
If it is no longer the case that we want to try and unify the two (IncPCA
and PCA), it may be worth just putting _BasePCA back onto IncrementalPCA
and keeping the code separate.

On Wed, Sep 23, 2015 at 9:10 AM, Olivier Grisel notifications@github.com
wrote:

I have the feeling that IncrementalPCA has very few common code to share
with the regular PCA / _BasePCA classes. IMO we should not try to use
inheritance when there is no significant amount of code to reuse. Using
inheritance can make it harder to read the code so it has to be justified
by code reuse.


Reply to this email directly or view it on GitHub
#5299 (comment)
.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 23, 2015

Ah ok sorry for the confusion. I thought it was about the old ProbabilisticPCA that was actually already collapsed into PCA.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 29, 2015

Inheritance from _BasePCA was smooth and is now in the branch.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 30, 2015

We have to decide what to do with this behaviour

Due to implementation subtleties of the Singular Value Decomposition (SVD),
which is used in this implementation, running it twice on the same matrix
can lead to principal components with signs flipped (change in direction).
For this reason, it is important to always use the same estimator object to
transform data in a consistent fashion.

This paragraph was in PCA's docstring. RandomizedPCA handled it calling svd_flip in randomized_SVD. We can either call it for all the svd_solver methods or None.

To note: I have been working on the performance of randomized_svd in #5141 and svd_flip can be expensive becuase of this operation.

return X_original
@deprecated("it will be removed in 0.19. Use PCA(svd_solver='randomized') "
"instead. The new implementation DOES NOT store"
"whithen components_. Apply transform to get them.")

This comment has been minimized.

@TomDLT

TomDLT Sep 30, 2015

Member

whiten

This comment has been minimized.

@giorgiop

giorgiop Sep 30, 2015

Contributor

Thanks.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 30, 2015

I agreed with @ogrisel to call svd_flip in every method. We choose not to expose a flip_sign param, so as to keep lighter signature for the constructor of PCA.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 30, 2015

Simple runtime benchmark script

import numpy as np
from sklearn.decomposition.pca import PCA
from time import time
import gc

rng = np.random.RandomState(42)
X = np.random.uniform(-1, 1, size=(3000, 3000))

solver_list = ['full', 'arpack', 'randomized']

def bench_pca(X, pca):
    gc.collect()
    t0 = time()
    pca.fit(X)
    return time() - t0

for n_comp in [10, 50, 100, 500, 1000, 2400, 2999]:
    print("\n n_components=%d" % n_comp)
    for solver in solver_list:

        print("svd_solver='%s'" % solver)
        duration = bench_pca(X, PCA(n_components=n_comp, svd_solver=solver))
        print("done in %0.3fs" % duration)

Output:

 n_components=10
svd_solver='full'
done in 17.718s
svd_solver='arpack'
done in 3.084s
svd_solver='randomized'
done in 0.588s

 n_components=50
svd_solver='full'
done in 19.997s
svd_solver='arpack'
done in 3.725s
svd_solver='randomized'
done in 1.232s

 n_components=100
svd_solver='full'
done in 18.813s
svd_solver='arpack'
done in 5.494s
svd_solver='randomized'
done in 0.611s

 n_components=500
svd_solver='full'
done in 18.712s
svd_solver='arpack'
done in 18.547s
svd_solver='randomized'
done in 2.030s

 n_components=1000
svd_solver='full'
done in 19.876s
svd_solver='arpack'
done in 63.463s
svd_solver='randomized'
done in 5.219s

 n_components=2400
svd_solver='full'
done in 19.190s
svd_solver='arpack'
done in 137.910s
svd_solver='randomized'
done in 19.167s

 n_components=2999
svd_solver='full'
done in 18.724s
svd_solver='arpack'
done in 124.373s
svd_solver='randomized'
done in 28.858s

This support the chosen auto mode, at least with regard of runtime. We could run test on accuracy if needed.

@giorgiop giorgiop changed the title from [WIP] Collapsing PCA and RandomizedPCA to [MRG] Collapsing PCA and RandomizedPCA Sep 30, 2015

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Sep 30, 2015

exact may be a better name for the full PCA. Let me know also if you have better ideas instead of arpack.

@@ -166,9 +193,9 @@ class PCA(BaseEstimator, TransformerMixin):
http://www.miketipping.com/papers/met-mppca.pdf. It is required to
computed the estimated data covariance and score samples.
Notes
References

This comment has been minimized.

@TomDLT

TomDLT Sep 30, 2015

Member

----- -> ----------

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 2, 2015

instead of 'full' I would use 'lapack'.

def inverse_transform(self, X):
"""Transform data back to its original space, i.e.,
return an input X_original whose transform would be X
if type(n_components) == str:

This comment has been minimized.

@ogrisel

ogrisel Oct 2, 2015

Member

Better test:

if isinstance(n_components, six.string_types):
random_state : int or RandomState instance or None (default None)
Pseudo Random Number generator seed control. If None, use the
numpy.random singleton.

This comment has been minimized.

@ogrisel

ogrisel Oct 2, 2015

Member

This is only used by the randomized and arpack solvers.

This comment has been minimized.

@giorgiop

giorgiop Oct 3, 2015

Contributor

Right. I have also added random seeding for arpack, which I forgot before.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 2, 2015

instead of 'full' I would use 'lapack'.

In retrospect, "full" is a good name but it would be worth mentioning that this is using the standard SVD solver from LAPACK in the docstring.

# Handle svd_solver
svd_solver = self.svd_solver
if svd_solver == 'auto':
if n_components < .8 * min(X.shape):

This comment has been minimized.

@ogrisel

ogrisel Oct 2, 2015

Member

It should be:

if n_components >=1 and n_components < .8 * min(X.shape):

instead. n_components in [0, 1] means using the explained variance ratio to select the actual number of components once a full SVD has been performed.

This comment has been minimized.

@giorgiop

giorgiop Oct 3, 2015

Contributor

Right

@@ -251,50 +303,68 @@ def fit_transform(self, X, y=None):
return U
def _fit(self, X):
"""Fit the model on X
"""Dispatch the actual fitting to _fit_full and _fit_truncated, after
handling svd_solver='auto' policy.

This comment has been minimized.

@ogrisel

ogrisel Oct 2, 2015

Member

PEP 257:

def _fit(self, X):
    """Dispatch to the right submethod depending on the choice of solver.

    The svd_solver='auto' case is expanded depending on the value of n_components.
    """
@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 3, 2015

Travis now complains while running with scipy 0.9, which did not have a v0 param in svds.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 4, 2015

Please use the backport from sklearn.utils.arpack.svds.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 5, 2015

Actually our backport of arpack.py is old and need to be synchronized with the current version in scipy (e.g. 0.16.0) to get v0 init for svds.

Also the way we do v0 init in sklearn is not good see: #5012.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 5, 2015

It should be done. v0 = random_state.uniform(-1, 1, size=min(X.shape)) as ARPACK does.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Oct 6, 2015

You should probably create a conda env with numpy=1.6.2 and scipy=0.11.0 to be able to test the backport on your local machine instead of suffering the latency of travis builds.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 6, 2015

I have updated the backport. The problematic differences are the v0 param of svds (changed in 0.12) and the output of arpack's wrapper scipy.sparse.linalg.eigen.arpack._arpack (changed in 0.11). I tried to avoid code duplication.

I have also tested locally scipy=0.10 with python=2.7 which is not part of Trevis' tested environments.

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 8, 2015

Benchmark on explained variance ratio, relative to full PCA.

import numpy as np
from sklearn.decomposition.pca import PCA

rng = np.random.RandomState(42)
dim = 3000
s = 10 * np.exp(- .5 * np.arange(dim))  # eigenvalues
X = np.diag(s) 
X += .1 * np.random.uniform(-1, 1, size=(dim, dim))

solver_list = ['full', 'arpack', 'randomized']

for n_comp in [10, 50, 100, 500, 1000, 2400, 2999]:
    print("\nn_components=%d" % n_comp)
    for solver in solver_list:
        print("svd_solver='%s'" % solver)
        pca = PCA(n_components=n_comp, svd_solver=solver)
        expl_var = pca.fit(X).explained_variance_ratio_.sum()
        print("percent of explained variance %.6f" % expl_var)
@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 8, 2015

n_components=10
svd_solver='full'
percent of explained variance 0.016484
svd_solver='arpack'
percent of explained variance 0.016484
svd_solver='randomized'
percent of explained variance 0.014992

n_components=50
svd_solver='full'
percent of explained variance 0.065607
svd_solver='arpack'
percent of explained variance 0.065607
svd_solver='randomized'
percent of explained variance 0.058819

n_components=100
svd_solver='full'
percent of explained variance 0.122431
svd_solver='arpack'
percent of explained variance 0.122431
svd_solver='randomized'
percent of explained variance 0.111168

n_components=500
svd_solver='full'
percent of explained variance 0.471458
svd_solver='arpack'
percent of explained variance 0.471458
svd_solver='randomized'
percent of explained variance 0.453004

n_components=1000
svd_solver='full'
percent of explained variance 0.741764
svd_solver='arpack'
percent of explained variance 0.741764
svd_solver='randomized'
percent of explained variance 0.730655

n_components=1000
svd_solver='full'
percent of explained variance 0.741764
svd_solver='arpack'
percent of explained variance 0.741764
svd_solver='randomized'
percent of explained variance 0.730655

n_components=2400
svd_solver='full'
percent of explained variance 0.993442
svd_solver='arpack'
percent of explained variance 0.993442
svd_solver='randomized'
percent of explained variance 0.993160

 n_components=2999
svd_solver='full'
percent of explained variance 1.000000
svd_solver='arpack'
percent of explained variance 1.000000
svd_solver='randomized'
percent of explained variance 1.000000

full and arpack are perfectly alligned. randomized loses a lot if the number of components is small. For example, with only n_components=10, we lose 9% ~ (0.016484 - 0.014992) / 0.016484

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 8, 2015

I believe a good trade off between runtime and explained variance for switching to arpack is around 5%. I mean to implement a rule for the 'auto' param like

if n_components < .05 * min(n_samples, n_features):
    svd_solver = 'arpack'
@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 9, 2015

I mean to implement a rule for the 'auto' param like
if n_components < .05 * min(n_samples, n_features):
svd_solver = 'arpack'

That's done.

This PR is complete from my side. I understand it is quite complex; I have put several things in it. They are all listed in the top post. Other than some enhancements and maintenance, I also fixed bug #3930 which has milestone 0.17. Please let me know if I can assist anyone with the review process.

@amueller

This comment has been minimized.

Member

amueller commented Oct 9, 2015

This is pretty awesome :) Thanks!

@amueller

This comment has been minimized.

Member

amueller commented Oct 9, 2015

Shouldn't svd_solver be default to full via None to be backward compatible? I guess it depends a bit on how far we are from previous results. We will change the default behavior when "auto" selects randomized now, right?

Can you please provide a comparison of the behavior in master and the new default behavior with auto, maybe as a scatter plot as you did for the comparison with fbpca?

@lesteve

This comment has been minimized.

Member

lesteve commented Mar 10, 2016

I'll merge this once we sort out the Travis..

I opened numpy issue for the scipy-dev-wheels failure: numpy/numpy#7403. Not sure about the Python3.5 failure, maybe a temporary glitch?

@lesteve

This comment has been minimized.

Member

lesteve commented Mar 10, 2016

Not sure about the Python3.5 failure, maybe a temporary glitch?

In particular this doesn't really make sense:

Vendor:  Continuum Analytics, Inc.

Package: mkl

Message: trial mode expires in 30 days

since now mkl is available by default and doesn't require a licence ...

@lesteve

This comment has been minimized.

Member

lesteve commented Mar 10, 2016

The AppVeyor failure for 32bit Python looks genuine AFAICT:

======================================================================
FAIL: sklearn.decomposition.tests.test_pca.test_svd_solver_auto
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\tests\test_pca.py", line 490, in test_svd_solver_auto
    assert np.all(pca.components_ == pca_test.components_)
AssertionError
@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 10, 2016

@giorgiop Better to use np.allclose or assert_array_almost)equal then :P

@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 10, 2016

@giorgiop Can you rebase over master?

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Mar 10, 2016

@giorgiop Better to use np.allclose or assert_array_almost)equal then :P

Ahah! Reverting ...

MechCoder added a commit that referenced this pull request Mar 10, 2016

Merge pull request #5299 from giorgiop/pca+randomizedpca
[MRG+3] Collapsing PCA and RandomizedPCA

@MechCoder MechCoder merged commit 25c931a into scikit-learn:master Mar 10, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 10, 2016

Merged !!!!

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Mar 10, 2016

Fantastic. Great team effort!

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Mar 11, 2016

@amueller

This comment has been minimized.

Member

amueller commented Jul 21, 2016

Hm by making "auto" the default, this was a backward-incompatible change, right? It's a great change, but looking back, usually we are more careful, right?

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Jul 22, 2016

You are right. We hadn't thought of that at the time!

@cbrnr

This comment has been minimized.

Contributor

cbrnr commented Aug 18, 2016

Also, could it be that the resulting components can now have a different sign because svd_flip is now always called? This also leads to different results as compared to 0.17.1...

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Aug 18, 2016

@cbrnr

This comment has been minimized.

Contributor

cbrnr commented Aug 19, 2016

No, I meant that with the introduction of svd_flip, the resulting PCA components will be different (in general) as compared to the previous version in 0.17.1. For example (run that with 0.17.1 and then with the current master):

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


np.random.seed(1)
x = np.random.rand(10, 5000)
x[::2] += 0.5 * x[::2] + 2 * np.random.rand(5, 5000)
x[-1, :] *= 10

pca = PCA()
pca.fit(x.T)
plt.imshow(pca.components_, interpolation="nearest")
plt.title("pca.components_ sklearn master")

The results are as follows:
sklearn_0 17 1
sklearn_master

Apparently, rows 4, 5, and 9 have been flipped in the new PCA version. I think it is really bad when results suddenly change without prior notice. Especially for such a run-of-the-mill method like PCA.

@amueller

This comment has been minimized.

Member

amueller commented Aug 25, 2016

It is a bit unfortunate but I think it will be hard for the results to stay exactly the same. @cbrnr can you point to where the svd_flip is now called where it wasn't before? @giorgiop why was that for consistency between the different solvers? If a different solver is chosen, we can not really expect the exact same result, though a close result would be good.
Though calling the sign flip more often allows at least some kind of consistent behavior.

@cbrnr I would argue that relying on the signs of eigenvectors is a really bad idea, and if you expect them to stay the same, you run into trouble either way. Without the sign flip, the outcome is random any how (created by minor choices in the SVD solver). Why did this come up?

[I totally got bitten by this when writing my book btw]

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 26, 2016

@cbrnr

This comment has been minimized.

Contributor

cbrnr commented Aug 28, 2016

@amueller I'm not saying that the current solution using svd_flip is not a good solution. In fact, I very much appreciate that there is a deterministic rule how the sign is determined. I'm just saying that it is not good practise that results of a standard algorithm suddenly change without prior notice, that's all (I spent some time figuring out why I suddenly got different results even though I hadn't changed anything in my code). Anyway, maybe a warning or a note in the next release would be helpful (preferably one that is displayed when using PCA).

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 28, 2016

@cbrnr

This comment has been minimized.

Contributor

cbrnr commented Aug 28, 2016

Yes, you're right, in this case the results were indeed undetermined. I'll work on the PR explaining the problem in the docs.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 28, 2016

@amueller

This comment has been minimized.

Member

amueller commented Aug 28, 2016

maybe a note in the API changes section (maybe that section needs a better title - i feel it should be "things you need to be careful about when updating" - only shorter ;)

@amueller

This comment has been minimized.

Member

amueller commented Aug 28, 2016

@cbrnr and I think we all agree that unexpected changes are always bad. We try our best to avoid them.

@cbrnr

This comment has been minimized.

Contributor

cbrnr commented Aug 28, 2016

@amueller Maybe "Important changes" instead of API changes? Anyway, I'll add something to this section. Also, I didn't want to complain. Of course such (and much worse) things happen and I'm happy to contribute!

@@ -351,6 +355,12 @@ def randomized_svd(M, n_components, n_oversamples=10, n_iter=2,
# this implementation is a bit faster with smaller shape[1]
M = M.T
# Adjust n_iter. 7 was found a good compromise for PCA. See #5299
if n_components < .1 * min(M.shape) and n_iter < 7:

This comment has been minimized.

@amueller

amueller Aug 31, 2016

Member

This behavior is really strange. I'm trying to fix it in #7311.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment