IPCA and CCIPCA #1885

kevinhughes27 · 2013-04-23T03:52:23Z

WIP - I expect some more work will be required before this can be merged wanted to make a PR now to get some feedback

…e, also added documentation, tests and an example

jaquesgrobler · 2013-04-23T11:11:36Z

doc/modules/decomposition.rst

+Candid Covariance-Free Incremental Principal Component Analysis (CCIPCA)
+------------------------------------------------------------------------
+
+CCIPCA like PCA is used to decompose a multivariate dataset in a set of successive 


Small thing - Would you mind just switching up the wording in this paragraph a bit, just so it's not so similiar to the IPCA's opening paragraph. Since this will go in the documentation, it may look a bit copy/pasty to users. Thanks :)

Okay will do, I figured there would be more opinions on the documentation so I didn't make them too different for the initial commit.

jaquesgrobler · 2013-04-23T11:24:42Z

Went through it very quickly. Very nice work as far as I could tell so far.
Thanks for your work on this

GaelVaroquaux · 2013-04-23T15:57:44Z

Hi @pickle27,

thanks a lot for your pull request. I am wonder, IPCA and CCIPCA are
techniques that are 10 to 15 years old and only got a couple hundreds
citations. I don't know these techniques so I cannot really comment on
there usefulness on practical data but given the amount of citations I
wonder: do they really help solving difficult data processing problems,
or are they mainly interesting for academic reasons?

The reason that I am asking this is that we still have some fairly core
improvement to do to scikit-learn and in terms of growing the features I
would like to focus on the high-impact methods.

kevinhughes27 · 2013-04-23T16:03:07Z

The main advantage of the incremental methods is being able to update the learned model if more data is acquired. I haven't tested this implementation but because it processes one sample at a time it should be able to handle larger data sets. Also the incremental methods are useful when developing an online system that is constantly receiving video for example a computer vision problem

GaelVaroquaux · 2013-04-23T21:08:25Z

The main advantage of the incremental methods is being able to update the
learned model if more data is acquired. I haven't tested this implementation
but because it processes one sample at a time it should be able to handle
larger data sets. Also the incremental methods are useful when developing an
online system that is constantly receiving video for example a computer vision
problem

Great! Online PCA is certainly a very useful thing, and I for one would
be most excited to have a good implementation in scikit-learn. The main
purpose of the online models in scikit-learn are to fit large dataset. So
we worry a lot about the CPU and memory usage of these.

In terms of fast online algorithms for PCA, there is a nice table in
[Rehurek2010], and IPCA does not appear in it. I am wondering if there
has not been progress since then. I don't know of any high performance
package that is using IPCA for online PCA, but packages rely on
algorithms listed in [Rehurek2010]. What made you choose IPCA in
particular? A few of the references in [Rehurek2010] have had a more
citations hit lately than the ICPCA paper.

Other interesting resource is
https://sites.google.com/site/recheliunew/researchblog/softwareincrementalsvdpackage
http://www.math.fsu.edu/~cbaker/IncPACK/
in particular the link to the SIAM presentation gives a good overview.

I am sorry if I appear picky when asking all these questions. I am very
excited about the feature. I am just trying to find what is the best
algorithm to include in scikit-learn. One thing to keep in mind is that
there will be a significant effort in making the implementation fast and
robust, so making the right choice does matter.

A few minor remarks on the code (I just had a quick look): the method
'inc_fit' should be called 'partial_fit' to match scikit-learn API. And
I'd prefer the object and the file to be named 'IncrementalPCA' and
'incremental_pca': it is easier to guess what it does. Also, there is
some redundancy in the different PCA implementation (the transform and
inverse_transform methods are the same), so we need to create a base
class to factor these out and use inheritance.

Thanks a lot for your PR: this is an incredibly useful feature!

[Rehurek2010] http://arxiv.org/pdf/1102.5597.pdf

kevinhughes27 · 2013-04-24T00:25:21Z

Made the small changes you suggested. I noticed the duplication of the transform method etc as well. What do you propose for the base class structure?

I came across IPCA and CCIPCA for my thesis research dealing with subspace learning applied to computer vision. Most of the vision papers were doing something very similar to the IPCA implementation. There are probably a few other references that would be suitable to add to IPCA as that technique is quite common, for example I took a quick read through Manjunath1995 from http://www.math.fsu.edu/~cbaker/IncPACK/ and it looks to be quite similar. Perhaps IPCA is just lacking a clear reference choice.

I read through Rehurek2010 and it looks interesting. I'm happy to keep working on this and find which algorithm we want for scikit-learn and it will also serve to help my thesis :) I am also going to be submitting a proposal for GSoC, I had originally planned to apply to work on a online NMF algorithm (as per the ideas page) but perhaps I could propose to continue to work on this if that sounds like a good idea.

kevinhughes27 · 2013-05-14T15:28:35Z

Definitely put this on hold, I've found a bug in plain incremental PCA and it doesn't scale at all - CCIPCA is fine though

kevinhughes27 · 2013-05-14T16:06:08Z

bug is fixed but ipca is still very slow on higher dimensional data (images) - lets talk about which incremental method we want!

kevinhughes27 · 2013-05-14T17:42:50Z

I have another variant of IPCA that does scale, I won't bother formatting it for scikit-learn until we finalize which algorithm we want

AlexandreAbraham · 2013-05-17T14:38:18Z

Hi Kevin ! I am very interested by this topic as I am going to need such method soon. I will make my own idea thanks to all the references here but I would like to know if you have made some progress on this question, please let me know.

GaelVaroquaux · 2013-05-17T15:35:47Z

I have another variant of IPCA that does scale, I won't bother formatting it
for scikit-learn until we finalize which algorithm we want

Would you mind giving the reference in this pull request. I am very
interested, just lacking time!

kevinhughes27 · 2013-05-17T15:37:26Z

hey I am actually formatting the reference and the code right now (not with the intention of merging but just to help the discussion)

GaelVaroquaux · 2013-05-17T15:39:29Z

hey I am actually formatting the reference and the code right now (not
with the intention of merging but just to help the discussion)

That's really helpful. @AlexandreAbraham is a PhD student working with me
and will have a look too, to have gaining insight.

G

kevinhughes27 · 2013-05-17T16:00:11Z

I added my implementation for the incremental technique described in:

Skocaj, Danijel, and Ales Leonardis. "Weighted and robust incremental method for subspace learning." Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003.

It should produce the same subspace as Hall but is more efficient with higher dimensional data like images because it uses the projections to update the basis vectors rather than the original data. Google scholar says the paper has been cited 135 times and this paper is very practical for my research but as per our discussion before I don't really know if this is the canonical incremental method that we should focus on for scikit-learn.

Anyways here it is and keep me in the loop about what direction you're heading and I'll be happy to help out! IPCA is a big part of my thesis so I am definitely invested in exploring alternatives.

patricksnape · 2013-05-24T14:24:08Z

I am also quite interested in this topic, and would note that one of the papers that I frequently use cites:

Ross, David A., et al. "Incremental learning for robust visual tracking." 
International Journal of Computer Vision 77.1-3 (2008): 125-141.

for incremental learning. This paper has 544 citations and is known to be robust.

kevinhughes27 · 2013-05-24T17:06:32Z

cool thanks for sharing I will give that paper a read when I get a chance

kevinhughes27 · 2013-05-27T17:14:53Z

looks interesting and the authors did post matlab source code. They compared their method to batch PCA and Hall IPCA. It should be quite easy to port their work into python if thats what we want. I am short on time right now so I don't want to do this until I get some confirmation that this is the technique that scikit learn wants.

mblondel · 2014-01-08T05:02:10Z

@pickle27 Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

kevinhughes27 · 2014-01-08T15:54:25Z

@mblondel sure I like that idea. I'll work on doing this in the next couple of days.

GaelVaroquaux · 2014-01-08T16:58:04Z

As a side note, once we have more insight on the different choices (which
the gist and its usage might provide), I'd really like to see an
incremental PCA in scikit-learn. It is so important for people working
with "big data".

mblondel · 2014-01-09T09:35:27Z

I think that for self-contained PRs which do not require modifying existing code in scikit-learn, it would be useful if people could first create a github repo then discuss inclusion into scikit-learn on the ML. This way i) we can test the repo beforehand without putting to much effort in review ii) we can reject early if needed iii) code is not lost if the PR is stalled iv) we promote our fit / predict / transform API.

GaelVaroquaux · 2014-01-09T10:16:21Z

I think that for self-contained PRs which do not require modifying existing
code in scikit-learn, it would be useful if people could first create a github
repo then discuss inclusion into scikit-learn on the ML.

A gist, or a separate small repo (commenting on a repo is richer than in
a gist).

Indeed, I agree, and I think that the contributing notes could be
modified to suggest this, and say that these new features should be
discussed using an enhancement proposal/RFC discussion.

kevinhughes27 · 2014-01-09T16:05:12Z

I am actually going to make this into a small repo called pyIPCA. How do you suppose we can monitor the usage of each IPCA algorithm from the repo?

GaelVaroquaux · 2014-01-09T16:14:40Z

I am actually going to make this into a small repo called pyIPCA.

Sounds great.

How do you
suppose we can monitor the usage of each IPCA algorithm from the repo?

Not that I know.

kevinhughes27 · 2014-01-09T16:17:33Z

Also can we get a link to that github page on the main scikit learn site if there isn't one already? Something like "didn't find what you're looking for? Check the waiting list!" Then explain a bit about the process for who code gets into sklearn and why we want to be careful

mblondel · 2014-01-09T16:51:59Z

There is this wiki page:
https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

Not yet advertised on the website though.

kevinhughes27 · 2014-01-11T21:17:23Z

done: https://github.com/pickle27/pyIPCA
and added to the wiki page.

I do think that a link from the scikit to that wiki webpage would be good

mblondel · 2014-01-12T00:21:57Z

+1

Kevin Hughes added 2 commits April 16, 2013 13:18

CCIPCA initial commit, added the algorithm to the decomposition modul…

0ec0797

…e, also added documentation, tests and an example

ipca initial commit

7dc266c

jaquesgrobler reviewed Apr 23, 2013
View reviewed changes

renamed IPCA to IncrementalPCA etc and changed inc_fit to partial_fit

cf1a706

fixed bug in ipca

516bef1

added another ipca method

133a01d

kevinhughes27 closed this Jan 11, 2014

Uh oh!

IPCA and CCIPCA #1885

IPCA and CCIPCA #1885

Uh oh!

Conversation

kevinhughes27 commented Apr 23, 2013

Uh oh!

jaquesgrobler Apr 23, 2013

Choose a reason for hiding this comment

Uh oh!

kevinhughes27 Apr 23, 2013

Choose a reason for hiding this comment

Uh oh!

jaquesgrobler commented Apr 23, 2013

Uh oh!

GaelVaroquaux commented Apr 23, 2013

Uh oh!

kevinhughes27 commented Apr 23, 2013

Uh oh!

GaelVaroquaux commented Apr 23, 2013

Uh oh!

kevinhughes27 commented Apr 24, 2013

Uh oh!

kevinhughes27 commented May 14, 2013

Uh oh!

kevinhughes27 commented May 14, 2013

Uh oh!

kevinhughes27 commented May 14, 2013

Uh oh!

AlexandreAbraham commented May 17, 2013

Uh oh!

GaelVaroquaux commented May 17, 2013

Uh oh!

kevinhughes27 commented May 17, 2013

Uh oh!

GaelVaroquaux commented May 17, 2013

Uh oh!

kevinhughes27 commented May 17, 2013

Uh oh!

patricksnape commented May 24, 2013

Uh oh!

kevinhughes27 commented May 24, 2013

Uh oh!

kevinhughes27 commented May 27, 2013

Uh oh!

mblondel commented Jan 8, 2014

Uh oh!

kevinhughes27 commented Jan 8, 2014

Uh oh!

GaelVaroquaux commented Jan 8, 2014

Uh oh!

mblondel commented Jan 9, 2014

Uh oh!

GaelVaroquaux commented Jan 9, 2014

Uh oh!

kevinhughes27 commented Jan 9, 2014

Uh oh!

GaelVaroquaux commented Jan 9, 2014

Uh oh!

kevinhughes27 commented Jan 9, 2014

Uh oh!

mblondel commented Jan 9, 2014

Uh oh!

kevinhughes27 commented Jan 11, 2014

Uh oh!

mblondel commented Jan 12, 2014

Uh oh!

Uh oh!