Skip to content

Conversation

kevinhughes27
Copy link

WIP - I expect some more work will be required before this can be merged wanted to make a PR now to get some feedback

Kevin Hughes added 2 commits April 16, 2013 13:18
Candid Covariance-Free Incremental Principal Component Analysis (CCIPCA)
------------------------------------------------------------------------

CCIPCA like PCA is used to decompose a multivariate dataset in a set of successive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing - Would you mind just switching up the wording in this paragraph a bit, just so it's not so similiar to the IPCA's opening paragraph. Since this will go in the documentation, it may look a bit copy/pasty to users. Thanks :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will do, I figured there would be more opinions on the documentation so I didn't make them too different for the initial commit.

@jaquesgrobler
Copy link
Member

Went through it very quickly. Very nice work as far as I could tell so far.
Thanks for your work on this

@GaelVaroquaux
Copy link
Member

Hi @pickle27,

thanks a lot for your pull request. I am wonder, IPCA and CCIPCA are
techniques that are 10 to 15 years old and only got a couple hundreds
citations. I don't know these techniques so I cannot really comment on
there usefulness on practical data but given the amount of citations I
wonder: do they really help solving difficult data processing problems,
or are they mainly interesting for academic reasons?

The reason that I am asking this is that we still have some fairly core
improvement to do to scikit-learn and in terms of growing the features I
would like to focus on the high-impact methods.

@kevinhughes27
Copy link
Author

The main advantage of the incremental methods is being able to update the learned model if more data is acquired. I haven't tested this implementation but because it processes one sample at a time it should be able to handle larger data sets. Also the incremental methods are useful when developing an online system that is constantly receiving video for example a computer vision problem

@GaelVaroquaux
Copy link
Member

The main advantage of the incremental methods is being able to update the
learned model if more data is acquired. I haven't tested this implementation
but because it processes one sample at a time it should be able to handle
larger data sets. Also the incremental methods are useful when developing an
online system that is constantly receiving video for example a computer vision
problem

Great! Online PCA is certainly a very useful thing, and I for one would
be most excited to have a good implementation in scikit-learn. The main
purpose of the online models in scikit-learn are to fit large dataset. So
we worry a lot about the CPU and memory usage of these.

In terms of fast online algorithms for PCA, there is a nice table in
[Rehurek2010], and IPCA does not appear in it. I am wondering if there
has not been progress since then. I don't know of any high performance
package that is using IPCA for online PCA, but packages rely on
algorithms listed in [Rehurek2010]. What made you choose IPCA in
particular? A few of the references in [Rehurek2010] have had a more
citations hit lately than the ICPCA paper.

Other interesting resource is
https://sites.google.com/site/recheliunew/researchblog/softwareincrementalsvdpackage
http://www.math.fsu.edu/~cbaker/IncPACK/
in particular the link to the SIAM presentation gives a good overview.

I am sorry if I appear picky when asking all these questions. I am very
excited about the feature. I am just trying to find what is the best
algorithm to include in scikit-learn. One thing to keep in mind is that
there will be a significant effort in making the implementation fast and
robust, so making the right choice does matter.

A few minor remarks on the code (I just had a quick look): the method
'inc_fit' should be called 'partial_fit' to match scikit-learn API. And
I'd prefer the object and the file to be named 'IncrementalPCA' and
'incremental_pca': it is easier to guess what it does. Also, there is
some redundancy in the different PCA implementation (the transform and
inverse_transform methods are the same), so we need to create a base
class to factor these out and use inheritance.

Thanks a lot for your PR: this is an incredibly useful feature!

[Rehurek2010] http://arxiv.org/pdf/1102.5597.pdf

@kevinhughes27
Copy link
Author

Made the small changes you suggested. I noticed the duplication of the transform method etc as well. What do you propose for the base class structure?

I came across IPCA and CCIPCA for my thesis research dealing with subspace learning applied to computer vision. Most of the vision papers were doing something very similar to the IPCA implementation. There are probably a few other references that would be suitable to add to IPCA as that technique is quite common, for example I took a quick read through Manjunath1995 from http://www.math.fsu.edu/~cbaker/IncPACK/ and it looks to be quite similar. Perhaps IPCA is just lacking a clear reference choice.

I read through Rehurek2010 and it looks interesting. I'm happy to keep working on this and find which algorithm we want for scikit-learn and it will also serve to help my thesis :) I am also going to be submitting a proposal for GSoC, I had originally planned to apply to work on a online NMF algorithm (as per the ideas page) but perhaps I could propose to continue to work on this if that sounds like a good idea.

@kevinhughes27
Copy link
Author

Definitely put this on hold, I've found a bug in plain incremental PCA and it doesn't scale at all - CCIPCA is fine though

@kevinhughes27
Copy link
Author

bug is fixed but ipca is still very slow on higher dimensional data (images) - lets talk about which incremental method we want!

@kevinhughes27
Copy link
Author

I have another variant of IPCA that does scale, I won't bother formatting it for scikit-learn until we finalize which algorithm we want

@AlexandreAbraham
Copy link
Contributor

Hi Kevin ! I am very interested by this topic as I am going to need such method soon. I will make my own idea thanks to all the references here but I would like to know if you have made some progress on this question, please let me know.

@GaelVaroquaux
Copy link
Member

I have another variant of IPCA that does scale, I won't bother formatting it
for scikit-learn until we finalize which algorithm we want

Would you mind giving the reference in this pull request. I am very
interested, just lacking time!

@kevinhughes27
Copy link
Author

hey I am actually formatting the reference and the code right now (not with the intention of merging but just to help the discussion)

@GaelVaroquaux
Copy link
Member

hey I am actually formatting the reference and the code right now (not
with the intention of merging but just to help the discussion)

That's really helpful. @AlexandreAbraham is a PhD student working with me
and will have a look too, to have gaining insight.

G

@kevinhughes27
Copy link
Author

I added my implementation for the incremental technique described in:

Skocaj, Danijel, and Ales Leonardis. "Weighted and robust incremental method for subspace learning." Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003.

It should produce the same subspace as Hall but is more efficient with higher dimensional data like images because it uses the projections to update the basis vectors rather than the original data. Google scholar says the paper has been cited 135 times and this paper is very practical for my research but as per our discussion before I don't really know if this is the canonical incremental method that we should focus on for scikit-learn.

Anyways here it is and keep me in the loop about what direction you're heading and I'll be happy to help out! IPCA is a big part of my thesis so I am definitely invested in exploring alternatives.

@patricksnape
Copy link

I am also quite interested in this topic, and would note that one of the papers that I frequently use cites:

Ross, David A., et al. "Incremental learning for robust visual tracking." 
International Journal of Computer Vision 77.1-3 (2008): 125-141.

for incremental learning. This paper has 544 citations and is known to be robust.

@kevinhughes27
Copy link
Author

cool thanks for sharing I will give that paper a read when I get a chance

@kevinhughes27
Copy link
Author

looks interesting and the authors did post matlab source code. They compared their method to batch PCA and Hall IPCA. It should be quite easy to port their work into python if thats what we want. I am short on time right now so I don't want to do this until I get some confirmation that this is the technique that scikit learn wants.

@mblondel
Copy link
Member

mblondel commented Jan 8, 2014

@pickle27 Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

@kevinhughes27
Copy link
Author

@mblondel sure I like that idea. I'll work on doing this in the next couple of days.

@GaelVaroquaux
Copy link
Member

As a side note, once we have more insight on the different choices (which
the gist and its usage might provide), I'd really like to see an
incremental PCA in scikit-learn. It is so important for people working
with "big data".

@mblondel
Copy link
Member

mblondel commented Jan 9, 2014

I think that for self-contained PRs which do not require modifying existing code in scikit-learn, it would be useful if people could first create a github repo then discuss inclusion into scikit-learn on the ML. This way i) we can test the repo beforehand without putting to much effort in review ii) we can reject early if needed iii) code is not lost if the PR is stalled iv) we promote our fit / predict / transform API.

@GaelVaroquaux
Copy link
Member

I think that for self-contained PRs which do not require modifying existing
code in scikit-learn, it would be useful if people could first create a github
repo then discuss inclusion into scikit-learn on the ML.

A gist, or a separate small repo (commenting on a repo is richer than in
a gist).

Indeed, I agree, and I think that the contributing notes could be
modified to suggest this, and say that these new features should be
discussed using an enhancement proposal/RFC discussion.

@kevinhughes27
Copy link
Author

I am actually going to make this into a small repo called pyIPCA. How do you suppose we can monitor the usage of each IPCA algorithm from the repo?

@GaelVaroquaux
Copy link
Member

I am actually going to make this into a small repo called pyIPCA.

Sounds great.

How do you
suppose we can monitor the usage of each IPCA algorithm from the repo?

Not that I know.

@kevinhughes27
Copy link
Author

Also can we get a link to that github page on the main scikit learn site if there isn't one already? Something like "didn't find what you're looking for? Check the waiting list!" Then explain a bit about the process for who code gets into sklearn and why we want to be careful

@mblondel
Copy link
Member

mblondel commented Jan 9, 2014

There is this wiki page:
https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

Not yet advertised on the website though.

@kevinhughes27
Copy link
Author

done: https://github.com/pickle27/pyIPCA
and added to the wiki page.

I do think that a link from the scikit to that wiki webpage would be good

@mblondel
Copy link
Member

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants