[MRG+2] Incremental PCA #3285

Merged
merged 2 commits into from Sep 23, 2014

Conversation

Projects
None yet
8 participants
Owner

kastnerkyle commented Jun 17, 2014

An implementation of Incremental PCA. See Incremental Learning for Visual Tracking, J. Lim, D. Ross, R. Lin, M. Yang, 2004 for more details.

The key idea behind this PR is to have an out-of-core PCA that gives roughly equivalent results while using constant memory (on the order of the batch_size rather than n_samples).

Coverage Status

Coverage increased (+0.02%) when pulling 9f3d149 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

sklearn/decomposition/incremental_pca.py
+ >>> ipca = IncrementalPCA(n_components=2)
+ >>> ipca.fit(X)
+ IncrementalPCA(batch_size=10, copy=True, n_components=2)
+ >>> print(np.linalg.norm(ipca.components_, axis=0)) # doctest: +ELLIPSIS
@eickenberg

eickenberg Jun 20, 2014

Contributor

Older versions of numpy don't understand the axis argument in np.linalg.norm. One of the travis builds is failing due to this.

sklearn/decomposition/incremental_pca.py
+ old_proj = old_S[:, np.newaxis] * old_components
+
+ combined_data = np.vstack((old_proj, X, append_vals))
+ Q, R = linalg.qr(combined_data, mode='economic')
@eickenberg

eickenberg Jun 20, 2014

Contributor

In this row-based version of the code, is the QR decomposition necessary again? As far as I read, it may not be. (If you don't do it, Q should be absorbed in U and S, V should stay the same.) If I am wrong, can you remind me what the issue was?

Contributor

eickenberg commented Jun 20, 2014

This looks really nice!

Coverage Status

Coverage increased (+0.02%) when pulling 5c5197f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Coverage Status

Coverage increased (+0.04%) when pulling adc28e7 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Owner

kastnerkyle commented Jun 25, 2014

err_vs_bs_ncomp_45
err_vs_bs_ncomp_180
err_vs_bs_ncomp_315
err_vs_bs_ncomp_450
err_vs_ncomp_bs_1000
rt_vs_bs_ncomp_45
rt_vs_bs_ncomp_180
rt_vs_bs_ncomp_315
rt_vs_bs_ncomp_450
rt_vs_ncomp_bs_1000

These plots show IncrementalPCA in comparison to PCA and RandomizedPCA. Note that for the error vs. batch_size tests, RandomizedPCA was not shown because it is significantly worse in all tested cases. This makes sense - random projections should be worse than projections estimated from the data (in general). The PCA lines in error and time vs. batch_size are constant because PCA has no batch_size, but it makes a good reference to compare against.

Note that the "jagged plot" has a y scale of 1E-8... pretty sure this is just small numerical differences - the average is consistent with PCA across batch size (and the change is +- 5e-8, which seems OK to me).

@kastnerkyle kastnerkyle changed the title from [WIP] Incremental PCA to [MRG] Incremental PCA Jun 26, 2014

Owner

kastnerkyle commented Jun 26, 2014

Additionally, it would be very nice to add whitening and explained variance to this estimator. I have a few leads on how to do this, but I cannot find any academic papers about incremental estimation of these parameters. If I can get explained variance, then whitening follows naturally. If anyone has any ideas I am all ears.

An added advantage is that if these parameters can be estimated, then PCA and IncrementalPCA could be merged since batch_size=n_samples processing is exactly the same as a "regular" PCA. This may or may not be a good option, but it is something to think about.

Contributor

eickenberg commented Jun 26, 2014

explained variance is nothing other than rescaled squared singular values.
you hav access to those, right?

On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:

Additionally, it would be very nice to add whitening and explained
variance to this estimator. I have a few leads on how to do this, but I
cannot find any academic papers about incremental estimation of these
parameters. If I can get explained variance, then whitening follows
naturally. If anyone has any ideas I am all ears.

An added advantage is that if these parameters can be estimated, then
PCA and IncrementalPCA could be merged since batch_size=n_samples
processing is exactly the same as a "regular" PCA. This may or may not be a
good option, but it is something to think about.


Reply to this email directly or view it on GitHub
#3285 (comment)
.

Owner

kastnerkyle commented Jun 26, 2014

I have access to the first ones separately, then everything after that is kind of "joint" in that you only get U, S, V = svd((old_components, X, mean_adjust)) instead of just U, S, V = svd(X) like in the first pass through.

You could do 2 SVDs for every pass through partial_fit but it seems pretty wasteful just for explained variance - and I feel there must be a way to compute it directly (or at least reverse it out somehow, since we know the first values and a combined version of all the ones after)!

Maybe using the old S (S for the next to last batch) in conjunction with the final S somehow?

Another approach would be to take the ideas from this StackExchange link and try to use them somehow...

Contributor

eickenberg commented Jun 26, 2014

OK, so the S you have access to do not correspond to the global S of the
data seen until then. I would have hoped that by "chance" they do, and that
the differences of the matrix you are working with wrt to the full matrix
are again "by chance" accounted for by a change in U (since V is definitely
the same, right?)

I do not know of a formula using old_S and S, but will keep the problem in
my head :)

On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:

I have access to the first ones separately, then everything after that is
kind of "joint" in that you only get U, S, V = svd((old_components, X,
mean_adjust)) instead of just U, S, V = svd(X) like in the first pass
through.

You could do 2 SVDs for every pass through partial_fit but it seems pretty
wasteful just for explained variance - and I feel there must be a way to
compute it directly (or at least reverse it out somehow, since we know the
first values and a combined version of all the ones after)!

Maybe using the old S (S for the next to last batch) in conjunction with
the final S somehow?


Reply to this email directly or view it on GitHub
#3285 (comment)
.

Owner

agramfort commented Jun 26, 2014

total variance is np.sum(X * X)
which you can calculate by accumulation
explained variance will be

(np.sum(eigvals) - np.sum(X * X)) / np.sum(X * X)

or something like this

clear?

Owner

kastnerkyle commented Jun 26, 2014

The total variance calculation is clear, but the eigvals part is not - the eigvals that fall out of the current calculation come from svd(vstack((previous_components, X, mean_adjust))), rather than just X. I am trying to run a test now to see if the eigvals at the end of several partial_fit calls is approximately equivalent to eigvals from PCA, but don't know yet. If this is the case, then I think the rule from the PCA block would work directly - calculate (S**2) / n_samples_seen basically, or something similar. If not...

Owner

kastnerkyle commented Jun 27, 2014

Initial investigations indicate that S ** 2 is still good for calculating explained_variance_ and explained_variance_ratio_ . Working on a commit now with calculations and unit tests, and will post some plots when that is done. This should mean that whitening can be added as well.

I guess now is a good time to start the discussion about inverse_transform whitening - I think it is preferable to invert the whitening as well, even though PCA doesn't currently do this. Because there is no prior usage of IncrementalPCA module, we would be starting off "correctly" IMO without the baggage/possible deprecation that may have to happen in PCA. See the discussion at #3107 for more details.

Contributor

eickenberg commented Jun 27, 2014

Nice! For incremental SVD I am almost sure I have a proof now for S always being the correct S, just like V is always being the correct V (for a full decomposition) may have found a way to show that they are at least close... For incremental PCA, additionally one would need to add the considerations for mean adjustment into this, just like in the cited paper. It could be straightforward, but I haven't looked at it.

Owner

kastnerkyle commented Jun 27, 2014

exp_var
exp_var_ratio

Comparison of IncrementalPCA vs PCA explained_variance_ and explained_variance_ratio. I am still skeptical of the explained_variance_ratio_ - not sure if I should track the overall sum as is currently happening, or only track the sum through [:n_components] ...

You lose information due to cutting components at each partial_fit, but just like the other plots, the closer your batch_size gets to n_samples and the more components are kept, the closer the estimate gets to the PCA result.

If this seems reasonable, I will work on adding whitening (as well as inverse transform unwhitening) and some tests for these extras (probably just comparing to PCA result to make sure they are close)

Generating script:

import numpy as np
from sklearn.decomposition import IncrementalPCA, PCA
from sklearn.datasets import make_low_rank_matrix
import matplotlib.pyplot as plt


def run_experiment(X, n_components_tests, batch_size_tests, ratio=False):
    all_err = []
    for n in n_components_tests:
        print("n_components %i" % n)
        pca = PCA(n_components=n)
        pca.fit(X)
        err = []
        for i in batch_size_tests:
            ipca = IncrementalPCA(n_components=n, batch_size=i)
            ipca.fit(X)
            if not ratio:
                print("Explained variance error, batch_size %i" % i)
                mae = np.mean(np.abs(ipca.explained_variance_ -
                                     pca.explained_variance_))
            else:
                print("Explained variance ratio error, batch_size %i" % i)
                mae = np.mean(np.abs(ipca.explained_variance_ratio_ -
                                     pca.explained_variance_ratio_))
            err.append(mae)
        all_err.append(err)
    return np.array(all_err)

X = make_low_rank_matrix(n_samples=10000, random_state=1999)
n_components_tests = np.array([2, 5, 10, 25, 50, 75, 100])
batch_size_tests = np.array([1000, 2500, 5000, 7500, 10000])
all_var_err = run_experiment(X, n_components_tests, batch_size_tests)
for n, i in enumerate(n_components_tests):
    plt.plot(batch_size_tests, all_var_err[n, :], label="n_components=%i" % i)
plt.xlabel("Batch size")
plt.ylabel("Explained variance")
plt.title("Mean absolute error between IncrementalPCA and PCA")
plt.legend(loc="upper right")
plt.figure()
all_ratio_err = run_experiment(X, n_components_tests, batch_size_tests,
                               ratio=True)
for n, i in enumerate(n_components_tests):
    plt.plot(batch_size_tests, all_ratio_err[n, :], label="n_components=%i" % i)
plt.xlabel("Batch size")
plt.ylabel("Explained variance ratio")
plt.title("Mean absolute error between IncrementalPCA and PCA")
plt.legend(loc="upper right")
plt.show()

@kastnerkyle kastnerkyle changed the title from [MRG] Incremental PCA to [WIP] Incremental PCA Jun 27, 2014

Coverage Status

Coverage increased (+0.07%) when pulling 65cd21f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Owner

kastnerkyle commented Jun 30, 2014

exp_var_ratio
Updated the explained variance ratio calculation. It seems closer to correct (errors are close to 0 for n_components=n_features, and otherwise similar to the previous calculation), but the "bump" in the chart leads me to believe that some kind of compensation has to happen when batches are different sizes. around 7500, where the bump is the worst, there are 2 batches - one of 5000, and one of 2500. Ideally, this plot should converge toward 0 as batch size increases.

Coverage Status

Coverage increased (+0.07%) when pulling b2af89f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Owner

kastnerkyle commented Jun 30, 2014

exp_var_ratio

After weighting with the ratio of the size of the batch vs. the total number of samples seen so far, this estimate looks closer to what I would expect. However, after talking with @ogrisel I am going to try using an online variance estimate (such as discussed here) and see if that gets a more accurate estimate of the ratio - .05 out of a total of 1.0 is still 5% error...

Coverage Status

Coverage increased (+0.07%) when pulling 5d9b4c1 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Owner

kastnerkyle commented Jul 1, 2014

exp_var
exp_var_ratio

After tracking the explained_variance_ sum incrementally (using a batch Youngs and Cramer method, found in this paper), things look much better. Now all that is left is to add whitening and noise variance estimates as well as tests. Special thanks to @eickenberg and @ogrisel for discussions about this.

Coverage Status

Coverage increased (+0.02%) when pulling e0cb293 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Coverage Status

Coverage increased (+0.03%) when pulling 500a298 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@kastnerkyle kastnerkyle changed the title from [WIP] Incremental PCA to [MRG] Incremental PCA Jul 2, 2014

Owner

kastnerkyle commented Jul 2, 2014

I think this is ready for another review/look by other people

@kastnerkyle kastnerkyle changed the title from [MRG] Incremental PCA to [WIP] Incremental PCA Jul 4, 2014

Owner

kastnerkyle commented Jul 4, 2014

@ogrisel and I talked some about factoring some common methods between PCA, RandomizedPCA, KernelPCA, and others (get_precision, get_covariance, transform, inverse_transform) into a BasePCA class, but for now I just re-used the calculations from PCA. In the long term it might be something to think about and discuss.

Owner

agramfort commented Jul 4, 2014

+1

Coverage Status

Coverage increased (+0.03%) when pulling 966fa01 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Owner

kastnerkyle commented Jul 4, 2014

Thinking more about this, what about a PCA mixin that inherits from TransformerMixin and specifically has these methods (get_precision, get_covariance, possibly transform, inverse transform)? That way, the classes that need those can bring in the mixin, otherwise they don't have to and can just use TransformerMixin. I don't know that BasePCA makes as much sense - each of the PCA algorithms have differences in different areas and a mixin might be a little nicer to include (or not) as needed.

Owner

agramfort commented Jul 4, 2014

I woudl start with BasePCA. If something becomes unnatural +1 for mixin

Owner

ogrisel commented Jul 4, 2014

Same here. Name it _BasePCA to reflect that it's a base class not part part of the public API.

Owner

kastnerkyle commented Jul 10, 2014

OK - my preference would be to make IncrementalPCA use _BasePCA , then do separate PRs for refactoring each of the other PCA models. Does this sound like a good plan?

Owner

agramfort commented Jul 10, 2014

fine with me

Coverage Status

Coverage increased (+0.05%) when pulling e02126a on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

Owner

kastnerkyle commented Jul 10, 2014

_BasePCA now houses common init parameters, get_covariance, get_precision, transform, and inverse_transform. These should be the same for almost all the PCA based methods, though there may need to be mods to accomodate different limitations of algorithms (Kernel, Randomized, standard PCA). IncrementalPCA was using this code in common.

Owner

kastnerkyle commented Jul 10, 2014

After talking to @ogrisel, I removed the super().__init__ stuff. It seems cleaner to keep the __init__ args locally, and simply pull out the common functions.

Owner

kastnerkyle commented Jul 10, 2014

Another point for #3361 ... agh

Owner

ogrisel commented Jul 10, 2014

#3361 is merged, you can rebase on it if you wish.

Owner

kastnerkyle commented Jul 10, 2014

Will do - would like to see that green checkmark! Had rebase conflicts(?) but merge seemed OK...

Coverage Status

Coverage increased (+0.04%) when pulling f8f1f50 on kastnerkyle:incremental_pca into 48e8cd6 on scikit-learn:master.

Owner

kastnerkyle commented Jul 14, 2014

Did little profiling, as I was unsure about having repeated vstack calls inside partial_fit. It looks like the vstack is not a problem, as the time is largely dominated by svd. Also, line_profiler does not work with python 3 yet :( . Overall, I think this is ready.

          384058 function calls (376709 primitive calls) in 2.219 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      250    0.532    0.002    0.569    0.002 decomp_svd.py:15(svd)
      213    0.103    0.000    0.288    0.001 doccer.py:12(docformat)
    865/1    0.074    0.000    2.224    2.224 {built-in method exec}
        1    0.074    0.074    0.074    0.074 {built-in method decompress}
    68874    0.059    0.000    0.059    0.000 {method 'append' of 'list' objects}
      407    0.057    0.000    0.057    0.000 {built-in method loads}
     3789    0.052    0.000    0.052    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      222    0.051    0.000    0.094    0.000 doccer.py:128(indentcount_lines)
    99/91    0.043    0.000    0.084    0.001 {built-in method load_dynamic}
     1276    0.040    0.000    0.161    0.000 <frozen importlib._bootstrap>:2016(find_spec)
39396/39148    0.034    0.000    0.034    0.000 {built-in method len}
      250    0.031    0.000    0.761    0.003 incremental_pca.py:197(partial_fit)

SNIP

  249    0.001    0.000    0.022    0.000 shape_base.py:179(vstack)

Commandline:

python -m cProfile -s tottime temp_bench_incremental.py | less

Script:

import numpy as np
from sklearn.decomposition import IncrementalPCA
from sklearn.datasets import fetch_lfw_people

faces = fetch_lfw_people(resize=.2, min_faces_per_person=5)
# limit dataset to 5000 people (don't care who they are!)
X = faces.data[:5000]
n_samples, h, w = faces.images.shape
n_features = X.shape[1]

X -= X.mean(0)
X /= np.sqrt(np.sum(X ** 2, axis=0))

ipca = IncrementalPCA(n_components=10, batch_size=20)
ipca.fit(X)

Coverage Status

Coverage increased (+0.01%) when pulling 2cc9f65 on kastnerkyle:incremental_pca into b65e4c8 on scikit-learn:master.

@kastnerkyle kastnerkyle changed the title from [WIP] Incremental PCA to [MRG] Incremental PCA Jul 15, 2014

benchmarks/bench_plot_incremental_pca.py
+from sklearn.decomposition import IncrementalPCA, RandomizedPCA, PCA
+from time import time
+import gc
+from sklearn.datasets import fetch_lfw_people
@agramfort

agramfort Jul 15, 2014

Owner

please reorder imports (numpy first and sklearn last)

+
+==========
+Incremental PCA
+==========
@agramfort

agramfort Jul 15, 2014

Owner

not enough ===

@eickenberg

eickenberg Jul 15, 2014

Contributor

I have spare === here if you need any

+y = iris.target
+target_names = iris.target_names
+
+cp = 2
@agramfort

agramfort Jul 15, 2014

Owner

cp -> n_components

+target_names = iris.target_names
+
+cp = 2
+ipca = IncrementalPCA(n_components=cp)
@agramfort

agramfort Jul 15, 2014

Owner

explicit batch_size to make the point

+X_r = ipca.fit(X).transform(X)
+
+pca = PCA(n_components=cp)
+X_r2 = pca.fit(X).transform(X)
@agramfort

agramfort Jul 15, 2014

Owner

use fit_transform method

sklearn/decomposition/incremental_pca.py
+
+ Attributes
+ ----------
+ `components_` : array, [n_components, n_features]
@agramfort

agramfort Jul 15, 2014

Owner

array, [n_components, n_features]

->

array, shape (n_components, n_features)

sklearn/decomposition/incremental_pca.py
+ `components_` : array, [n_components, n_features]
+ Components with maximum variance.
+
+ `explained_variance_` : array, [n_components]
@agramfort

agramfort Jul 15, 2014

Owner

same remark on doc formatting

sklearn/decomposition/incremental_pca.py
+
+ def fit(self, X, y=None):
+ """Fit the model with X. This will effectively call partial_fit on
+ the entire dataset, split into ordered minibatches of size batch_size.
@agramfort

agramfort Jul 15, 2014

Owner

to follow pep 257 first comment should fit on one line

sklearn/decomposition/incremental_pca.py
+ stored_sum, unnormalized_variance = _variance_update(
+ old_stored_sum, new_stored_sum, old_unnormalized_variance,
+ new_unnormalized_variance, old_sample_count, new_sample_count)
+ explained_variance = S ** 2 / (total_samples)
@agramfort

agramfort Jul 15, 2014

Owner

remove () around total_samples

sklearn/decomposition/pca.py
+ Returns
+ -------
+ X_new : array-like, shape (n_samples, n_components)
+
@agramfort

agramfort Jul 15, 2014

Owner

add a docstring line

+
+def test_incremental_pca_batch_signs():
+ """Test that components_ sign is stable over batch sizes."""
+ rs = np.random.RandomState(1999)
@agramfort

agramfort Jul 15, 2014

Owner

standard acronym is rng

+
+def test_incremental_pca_batch_values():
+ """Test that components_ values are stable over batch sizes."""
+ rs = np.random.RandomState(1999)
sklearn/decomposition/incremental_pca.py
+
+ Notes
+ -----
+
@arjoly

arjoly Aug 18, 2014

Owner

I would remove this empty line.

sklearn/decomposition/incremental_pca.py
+
+ References
+ ----------
+
@arjoly

arjoly Aug 18, 2014

Owner

I would remove this empty line.

+ X: array-like, shape (n_samples, n_features)
+ Training data, where n_samples is the number of samples and
+ n_features is the number of features.
+
@arjoly

arjoly Aug 18, 2014

Owner

y is not documented.

@eickenberg

eickenberg Aug 18, 2014

Contributor

y doesn't seem to be used anywhere. Is it actually necessary to have it in the signature of fit?

@arjoly

arjoly Aug 18, 2014

Owner

Yes, otherwise you don't understand why there is a y. Something like this could work

y : is not used: placeholder to allow for usage in a Pipeline.
@kastnerkyle

kastnerkyle Aug 18, 2014

Owner

Yes I had it as a placholder for pipeline (transform as well). I will add a
docstring about being a placeholder for Pipeline compatibility

On Mon, Aug 18, 2014 at 11:31 AM, Arnaud Joly notifications@github.com
wrote:

In sklearn/decomposition/incremental_pca.py:

  •             batch_size=None):
    
  •    self.n_components = n_components
    
  •    self.whiten = whiten
    
  •    self.copy = copy
    
  •    self.batch_size = batch_size
    
  •    self.components_ = None
    
  • def fit(self, X, y=None):
  •    """Fit the model with X, using minibatches of size batch_size.
    
  •    Parameters
    

  •    X: array-like, shape (n_samples, n_features)
    
  •        Training data, where n_samples is the number of samples and
    
  •        n_features is the number of features.
    

Yes, otherwise you don't understand why there is a y. Something like this
could work

y : is not used: placeholder to allow for usage in a Pipeline.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3285/files#r16344045.

@eickenberg

eickenberg Aug 18, 2014

Contributor

Ah yes, thanks for the clarification. As far as I can see it is not mentioned as such in PCA, though. PCA does not have y=None in transform, only in fit, which seems to be the way things are handled in general (side note: causing PLS, which transforms both sides, not to work with the pipeline. But that is a completely different matter :))

@kastnerkyle

kastnerkyle Sep 9, 2014

Owner

I commented this now, and removed it from other places where it didn't need to be.

sklearn/decomposition/incremental_pca.py
+ self.noise_variance_ = None
+ self.var_ = None
+ self.n_samples_seen_ = 0
+ X = check_array(X)
@arjoly

arjoly Aug 18, 2014

Owner

Should we set the dtype=np.float here?

+ X: array-like, shape (n_samples, n_features)
+ Training data, where n_samples is the number of samples and
+ n_features is the number of features.
+
@arjoly

arjoly Aug 18, 2014

Owner

y is not documented.

sklearn/decomposition/incremental_pca.py
+ "processing" % (n_components, n_features))
+
+ if self.components_ is not None:
+ if self.components_.shape[0] != n_components:
@arjoly

arjoly Aug 18, 2014

Owner

I would merge the two if.

sklearn/decomposition/incremental_pca.py
+ "to %i between calls to partial_fit! Try "
+ "setting n_components to a fixed value." % (
+ self.components_.shape[0], n_components))
+ self.n_components_ = n_components
@arjoly

arjoly Aug 18, 2014

Owner

This could be set directly at line 197

sklearn/decomposition/incremental_pca.py
+ mean_correction = np.sqrt((self.n_samples_seen_ * n_samples) /
+ n_total_samples) * (self.mean_ -
+ col_batch_mean)
+ combined_data = np.vstack((self.singular_values_.reshape((-1, 1)) *
@arjoly

arjoly Aug 18, 2014

Owner

combined_data => X_augmented / X_combined?

+ X_transformed = ipca.transform(X)
+ X_transformed2 = ipca.fit_transform(X)
+
+ assert_array_almost_equal(X_transformed, X_transformed2)
@arjoly

arjoly Aug 18, 2014

Owner

This looks like a redundant assertion.

@eickenberg

eickenberg Aug 18, 2014

Contributor

In the original sklearn.decomposition.PCA, fit_transform is implemented separately and the test there is to assure that .fit_transform(X) == .fit(X).transform(X). Here, in IncrementalPCA, fit_transform is handled by the TransformerMixin of _BasePCA, so by definition it does exactly what .fit(X).transform(X) does. So as @arjoly is saying, this assertion seems redundant for iPCA.

+
+
+def test_incremental_pca_num_features_change():
+ """Test that components_ sign is stable over batch sizes."""
@arjoly

arjoly Aug 18, 2014

Owner

The docstring is not accurate.

+ assert_almost_equal(np.abs(Y_pca), np.abs(Y_ipca), 1)
+
+
+def test_incremental_pca_against_pca_random_data():
@arjoly

arjoly Aug 18, 2014

Owner

You could factorize this test and similar (test_incremental_pca_against_pca_iris) using yield check of nose.

@kastnerkyle

kastnerkyle Sep 5, 2014

Owner

Is there a benefit to doing this? I have never seen or used yield check before

@GaelVaroquaux

GaelVaroquaux Sep 6, 2014

Owner

The benefit is that if one assert breaks, the others are run.

I am +0 on this functionality. I find that it brings benefits, but also complexity.

@kastnerkyle

kastnerkyle Sep 9, 2014

Owner

I would prefer not to add that complexity here, but it is good to know that exists for the future.

+ updated_mean : array, shape (n_features,)
+
+ updated_variance : array, shape (n_features,)
+
@arjoly

arjoly Aug 18, 2014

Owner

It might be nice to also return the update_sample_count.

@ogrisel

ogrisel Sep 9, 2014

Owner

As per the discussion on private utils on the mailing list, I think we should make those new helper functions explicitly private with a leading _.

@kastnerkyle

kastnerkyle Sep 10, 2014

Owner

Ah yeah I will do that

Owner

arjoly commented Aug 18, 2014

When last commments are addressed, looks good to merge.

Owner

kastnerkyle commented Sep 5, 2014

I addressed the comments (except for one we are discussing) and think it is pretty close. Once it looks good I will rebase again then try for what's new changes and a merge!

Owner

GaelVaroquaux commented Sep 6, 2014

Have you added a mention of MiniBatchPCA in doc/modules/scaling_strategies.rst? I think that it would be important.

Owner

kastnerkyle commented Sep 6, 2014

I have not added anything in scaling strategies yet - will definitely do that!

Owner

kastnerkyle commented Sep 9, 2014

I added a reference to IncrementalPCA in scaling_strategies. I didn't see much else to add other than that.

sklearn/utils/tests/test_extmath.py
+ assert_almost_equal(final_count, A.shape[0])
+
+
+def test_ddof():
@ogrisel

ogrisel Sep 9, 2014

Owner

I would rename to test_incremental_variance_ddof.

@kastnerkyle

kastnerkyle Sep 10, 2014

Owner

Good call - test_ddof is pretty cryptic ;) or at least non-obvious

Coverage Status

Coverage increased (+0.04%) when pulling dfa80f6 on kastnerkyle:incremental_pca into 57f67d0 on scikit-learn:master.

sklearn/utils/tests/test_extmath.py
@@ -375,6 +402,61 @@ def test_fast_dot():
for x in [np.array([[d] * 10] * 2) for d in [np.inf, np.nan]]:
assert_raises(ValueError, _fast_dot, x, x.T)
+
+def test_update_formulas():
@ogrisel

ogrisel Sep 10, 2014

Owner

Same comment here: test_incremental_variance_update_formulas.

+
+ ipca = IncrementalPCA(batch_size=batch_size)
+ ipca.fit(X)
+ assert_almost_equal(ipca.explained_variance_ratio_.sum(), 1.0, 2)
@ogrisel

ogrisel Sep 10, 2014

Owner

Instead of comparing to 1.0 I would compare to the pca.explained_variance_ratio_.sum(). The fact that 2 PCA components capture 99%+ of the variance of the iris dataset is not trivial.

@ogrisel

ogrisel Sep 10, 2014

Owner

Actually:

>>> PCA(2).fit(iris.data).explained_variance_ratio_.sum()
0.97763177502480336

There might be a bug here, no?

@kastnerkyle

kastnerkyle Sep 10, 2014

Owner

Just saw this... you might be right. Seems weird to me

On Tue, Sep 9, 2014 at 8:18 PM, Olivier Grisel notifications@github.com
wrote:

In sklearn/decomposition/tests/test_incremental_pca.py:

+iris = datasets.load_iris()
+
+
+def test_incremental_pca():

  • """Incremental PCA on dense arrays."""
  • X = iris.data
  • batch_size = X.shape[0] // 3
  • ipca = IncrementalPCA(n_components=2, batch_size=batch_size)
  • X_transformed = ipca.fit_transform(X)
  • np.testing.assert_equal(X_transformed.shape, (X.shape[0], 2))
  • ipca = IncrementalPCA(batch_size=batch_size)
  • ipca.fit(X)
  • assert_almost_equal(ipca.explained_variance_ratio_.sum(), 1.0, 2)

Actually:

PCA(2).fit(iris.data).explained_variance_ratio_.sum()
0.97763177502480336

There might be a bug here, no?


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3285/files#r17336754.

@kastnerkyle

kastnerkyle Sep 10, 2014

Owner

Even with all 3 components PCA only captures 99%... I guess I need to look into the incremental formula again. Not sure what it could be though

@kastnerkyle

kastnerkyle Sep 10, 2014

Owner

Something is wrong... even one component is still returning 1.0000000004 for IncrementalPCA. Will have to dig in more

@kastnerkyle

kastnerkyle Sep 18, 2014

Owner

I am having trouble reproducing this now? Very strange. It is corresponding within .002 of PCA (at n_components=1) and seems to converge closer with more n_components, which agrees with the intuition. PCA drops components at the end after full-size SVD, while IncrementalPCA has to drop them intermediately, so they should not match exactly.

Olivier, can you check again that you see the same behavior? I am worried that it is something tricky now!

@kastnerkyle

kastnerkyle Sep 18, 2014

Owner
n_components = 1
----
IPCA: 0.922259873747
PCA: 0.924616207174
n_components = 2
----
IPCA: 0.976749521476
PCA: 0.977631775025
n_components = 3
----
IPCA: 0.994549005678
PCA: 0.99481691455

n_components = 4
----
IPCA: 1.0
PCA: 1.0
@GaelVaroquaux

GaelVaroquaux Sep 19, 2014

Owner

Well that looks great. If you cannot reproduce the problem, is that
something else that is blocking to merge this PR?

@kastnerkyle

kastnerkyle Sep 19, 2014

Owner

Don't think so - just need to finish addressing the last comments then push here. If everything else looks good, I will then rebase and make changes to what's new

@GaelVaroquaux

GaelVaroquaux Sep 19, 2014

Owner

Don't think so - just need to finish addressing the last comments then push
here. If everything else looks good, I will then rebase and make changes to

Cool. Ping us when you are done. I cannot wait to merge this guy.

Coverage Status

Coverage increased (+0.04%) when pulling e0425e3 on kastnerkyle:incremental_pca into 57f67d0 on scikit-learn:master.

+ Yt = IncrementalPCA(n_components=2).fit(X).transform(Xt)
+ Yt /= np.sqrt((Yt ** 2).sum())
+
+ assert_almost_equal(np.abs(Yt[0][0]), 1., 1)
@ogrisel

ogrisel Sep 10, 2014

Owner

Could explain why this is expected in an English sentence as an inline comment?

@kastnerkyle

kastnerkyle Sep 10, 2014

Owner

I snagged this test from PCA - it will take me little while to work out the logic behind it.

@kastnerkyle

kastnerkyle Sep 18, 2014

Owner

I think I figured it out - the key is that Xt is the "componentized" version of X. So it is testing that X is componentized correctly then reconstructed.

Coverage Status

Coverage increased (+0.03%) when pulling 6dd52a6 on kastnerkyle:incremental_pca into 57f67d0 on scikit-learn:master.

Coverage Status

Coverage increased (+0.03%) when pulling 8501fbb on kastnerkyle:incremental_pca into 57f67d0 on scikit-learn:master.

Owner

kastnerkyle commented Sep 19, 2014

I rebased and made 2 commits - one with all the IncrementalPCA stuff and one with what's new changes. As long as all tests pass I think it is good to go. Can also make it one commit entirely if that is better

Coverage Status

Coverage increased (+0.04%) when pulling f4da189 on kastnerkyle:incremental_pca into d34e928 on scikit-learn:master.

Owner

kastnerkyle commented Sep 19, 2014

@GaelVaroquaux @ogrisel Tests passed - any other changes/mods? Once this goes I can start on the random SVD solver

Owner

ogrisel commented Sep 19, 2014

+1 on my side.

@ogrisel ogrisel changed the title from [MRG] Incremental PCA to [MRG+1] Incremental PCA Sep 19, 2014

Owner

kastnerkyle commented Sep 22, 2014

What's new changed... I just rebased so it should be good to merge again.

Owner

ogrisel commented Sep 23, 2014

Please give a little more details for IncrementalPCA in whats new. E.g. "an implementation of the PCA algorithm that supports out-of-core learning with a partial_fit method."

Owner

ogrisel commented Sep 23, 2014

@arjoly, @GaelVaroquaux any final review? Maybe @larsmans would be interested in having a look at this one as well.

Owner

kastnerkyle commented Sep 23, 2014

I updated the what's new - your words summed it up pretty well :)

Coverage Status

Coverage increased (+0.04%) when pulling 5b5dd79 on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

+replacement for principal component analysis (PCA) when the dataset to be
+decomposed is too large to fit in memory. IPCA builds a low-rank approximation
+for the input data using an amount of memory which is independent of the
+input data size.
@larsmans

larsmans Sep 23, 2014

Owner

Independent of the number of samples, I suppose, not the number of features? (In document processing, the number of features tends to grow as O(√_n_) in the number of samples, with a big constant.)

@kastnerkyle

kastnerkyle Sep 23, 2014

Owner

Right. The number of features has to stay constant over all batches and the featurewise cost is pretty much SVD dependent... I am looking at a future PR to add the ability to use randomized SVD in the case where desired n_components <<< n_features, which is pretty much exactly the case for documents IIRC! Basically a solver = 'random' parameter that changes the central SVD. @ogrisel wanted this and it is a really good idea :)

sklearn/decomposition/incremental_pca.py
+ self.whiten = whiten
+ self.copy = copy
+ self.batch_size = batch_size
+ self.components_ = None
@larsmans

larsmans Sep 23, 2014

Owner

This is setting a fit attribute in __init__.

Owner

larsmans commented Sep 23, 2014

I'm late to the party and I don't know the algorithm, but this seems like a useful addition and (apart from my comments) a well-written, readable estimator class.

It looks like one could even do a sparse-matrix PCA with this: densify in batches, then feed those to the estimator (don't worry Gaël, I'm not suggesting to add that to the class :).

Owner

kastnerkyle commented Sep 23, 2014

I never really thought about densify-ing sparse data, but I like the idea... will have to play with that. Maybe adding the SVD solver to the parameters will open up some space to play with this? Assuming randomized SVD would be way better for sparse matrices.

Coverage Status

Coverage increased (+0.04%) when pulling fd0c768 on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

Owner

larsmans commented Sep 23, 2014

Randomized SVD is great for sparse data, but after mean-centering the sparsity is destroyed. A different solver could do the mean-centering on the fly instead of as a preprocessing step to avoid densifying. That could even help in the dense case, as there's no longer a need to first compute an X - mu matrix.

Owner

ogrisel commented Sep 23, 2014

Indeed @larsmans we could do real PCA with sparse data by centering only small mini-batches on the fly.

I would vote to implement that in a separate PR though.

Owner

larsmans commented Sep 23, 2014

Oh, I'm just mentioning it, not requesting that it should be written now.

Owner

kastnerkyle commented Sep 23, 2014

I agree on the new PR side, but I definitely think it is a good idea! Just updated the double backticks.

Coverage Status

Coverage increased (+0.04%) when pulling 5f8271f on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

Owner

GaelVaroquaux commented Sep 23, 2014

It looks like one could even do a sparse-matrix PCA with this: densify
in batches, then feed those to the estimator (don't worry Gaël, I'm not
suggesting to add that to the class :).

Well, in a later PR, I would see value in that. In this PR, I am afraid
of delaying the merge of an already very useful feature.

Owner

larsmans commented Sep 23, 2014

Sure, sure. +1 for merge.

@larsmans larsmans changed the title from [MRG+1] Incremental PCA to [MRG+2] Incremental PCA Sep 23, 2014

ogrisel added a commit that referenced this pull request Sep 23, 2014

@ogrisel ogrisel merged commit 7426e6a into scikit-learn:master Sep 23, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Owner

ogrisel commented Sep 23, 2014

Thanks @kastnerkyle!

Owner

arjoly commented Sep 23, 2014

Thanks @kastnerkyle !!! Congratulation for your high quality code.

Owner

kastnerkyle commented Sep 23, 2014

Thanks everyone for all the review and improvements :) glad to see this merged

Owner

agramfort commented Sep 23, 2014

🍻

Contributor

eickenberg commented Sep 23, 2014

cool! thanks for the great work!

On Tuesday, September 23, 2014, Alexandre Gramfort notifications@github.com
wrote:

[image: 🍻]


Reply to this email directly or view it on GitHub
#3285 (comment)
.

Owner

GaelVaroquaux commented Sep 23, 2014

🍻

-------- Original message --------
From: Olivier Grisel notifications@github.com
Date:23/09/2014 18:00 (GMT+01:00)
To: scikit-learn/scikit-learn scikit-learn@noreply.github.com
Cc: Gael Varoquaux gael.varoquaux@normalesup.org
Subject: Re: [scikit-learn] [MRG+2] Incremental PCA (#3285)
Merged #3285.


Reply to this email directly or view it on GitHub.

IssamLaradji added a commit to IssamLaradji/scikit-learn that referenced this pull request Oct 13, 2014

@giorgiop giorgiop referenced this pull request Sep 23, 2015

Merged

[MRG+3] Collapsing PCA and RandomizedPCA #5299

8 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment