Add 2D2PCA ( Two-directional two-dimensional PCA) #1503

Closed
wants to merge 8 commits into
from

Projects

None yet

3 participants

yedtoss commented Dec 31, 2012

I add the Two-directional two-dimensional PCA in sklearn/decomposition/pca_2d.py

Tests are added in sklearn/decomposition/tests/test_pca_2d.py

and examples are in examples/applications/face_recognition_pca_2d.py

Reference : Daoqiang Zhang, Zhi-Hua Zhou, : Two-directional two-dimensional PCA for efficient face representation and recognition, Neurocomputing, Volume 69, Issues 1-3, December 2005, Pages 224-231

yedtoss commented Jan 2, 2013

It seems that new estimator should handle sparse matrix before passing the travis tests. I don't know why.

The PCA2D class expects a 3D array and scipy.sparse has only 2D matrices. Is there a way to use 3D sparse matrix in scikit-learn ?

Otherwise for the PCA2D class with sparse matrix I will be oblige to assume that the size of the third dimension is 1. By assuming this PCA2D will become similar to SparsePCA

Owner
amueller commented Jan 2, 2013

It is ok for an algorithm to not support sparse matrices. The tests check whether the algorithm either supports sparse matrices or gives an informative error. I guess somehow the test doesn't like the way you gave an error but I'd have to investigate.

More generally: I'm not familar with the algorithm you implemented. Is it specific to images? We don't really have image-specific algorithms, these are more appropriate for scikit-image.

yedtoss commented Jan 2, 2013

OK. I understand now. If the algorithm does not support sparse matrices it should raise a TypeError Exception. And the error message should contain the word "sparse" (File "sklearn/tests/test_common.py", line 269)

No the algorithm is not specific to image although many applications will use images. It can be applied to any 2d/1d data.

Owner
amueller commented Jan 2, 2013

Ok, thanks for your comment.
I am not entirely sure if this algorithm is a good fit for scikit-learn for the following to reasons:

  • I am not sure how well adopted and established the algorithm is. We usually only include algorithms that are widely used or at least well-known and established.
  • We have not included any algorithms specific to 2d data yet. In fact, apart from the feature_extraction modules (and the hmm module), all code expects the data to be an array or matrix of shape (n_samples, n_features).

Maybe you can comment at least on the first point, I'd like to hear others opinion on this.

Owner
mblondel commented Jan 2, 2013

What is the reference paper? A good criterion to judge if a paper is a good candidate for addition to scikit-learn is the number of citations of the reference paper.

yedtoss commented Jan 2, 2013

OK.
Yes 2D PCA is well-known.
-- The specific paper I followed has 212 citations (http://scholar.google.com/scholar?hl=fr&lr=&cites=15916815601316055194&um=1&ie=UTF-8&sa=X&ei=mX_kUOaFDsmGhQfO1IGADQ&ved=0CEkQzgIwAQ). And there is also another paper with 1725 citations (http://scholar.google.com/scholar?hl=fr&lr=&cites=5341271442669375619&um=1&ie=UTF-8&sa=X&ei=mX_kUOaFDsmGhQfO1IGADQ&ved=0CGsQzgIwBA)

-- 2D PCA algorithm is almost the same as PCA ( the methodology, justification .. are the same). So anyone using PCA can use 2D PCA. In fact as I have implemented it 2D PCA can receive 2D data. In this case it will output almost the same results as PCA

-- In many cases (especially for images), in order to use PCA, we reshape every sample to 1d data. So for example, if we have 500 samples of dimensions 1000 x1000, we reshape it to 500 x 1 000 000 matrix before applying PCA. With (standard) PCA it is really slow : SVD on 1 000 000 x 1 000 000 matrix or on 500 x 1 000 000 matrix ( depending on the implementation). 2D PCA consider directly the 500 x 1000 x 1000 samples data matrix. So the SVD used 1000 x 1000 matrix and is more efficient than PCA.

In conclusion I would say that 2D PCA is well known, as ( more ) useful as ( than) PCA. The implementation support 2D or 3D array. In case of 2D array 2D PCA = PCA. It should also be possible to used a faster svd implementation for 2D PCA. For example we could used the randomized_svd and get a Randomized 2D PCA.

yedtoss commented Jan 2, 2013

References papers :
1- Daoqiang Zhang, Zhi-Hua Zhou, : Two-directional two-dimensional PCA for efficient face representation and recognition, Neurocomputing, Volume 69, Issues 1-3, December 2005, Pages 224-231 http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/neucom05a.pdf 212 citations

2- Yang, Jian and Zhang, David and Frangi, Alejandro F. and Yang, Jing Y. : Two-dimensional PCA: a new approach to appearance-based face representation and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 1. (2004), pp. 131-137 http://repository.lib.polyu.edu.hk/jspui/bitstream/10397/190/1/137.pdf 1725 citations

Owner
mblondel commented Jan 2, 2013

In terms of number of citations, this seems relevant for inclusion in scikit-learn. Can you add the two paper references to the doctring? Have a look at other files for an example.

@mblondel mblondel commented on an outdated diff Jan 2, 2013
sklearn/decomposition/pca_2d.py
+#
+# License: BSD Style
+
+import numpy as np
+from scipy import linalg
+from scipy.sparse import issparse
+from ..base import BaseEstimator, TransformerMixin
+from ..utils import as_float_array, assert_all_finite
+
+
+class PCA2D(BaseEstimator, TransformerMixin):
+ """ Two directionnal two dimensional principal component analysis
+ This technique is based on 2D matrices as opposed to standard PCA which
+ is based on 1D vectors. It considers simultaneously the row and column
+ directions.2D PCA is computed by examining the covariance of the data.
+ It can easily get higher accuracy than 1D PCA.
mblondel
mblondel Jan 2, 2013 Owner

PCA is unsupervised so what you mean by accuracy is unclear...

@mblondel mblondel commented on an outdated diff Jan 2, 2013
sklearn/decomposition/pca_2d.py
+
+ `row_components_` : array, [n_row, n_row_components]
+ components with maximum variance on the row direction
+
+ `column_components_` : array, [n_column, n_column_components]
+ components with maximum variance on the column direction
+
+ See also
+ --------
+ PCA
+ ProbabilisticPCA
+ RandomizedPCA
+ KernelPCA
+ SparsePCA
+
+
mblondel
mblondel Jan 2, 2013 Owner

Your code contains lots of needless blank lines. More generally, your code should stick to the PEP8 coding style guidelines. You can use the "pep8" command line tool for helping you find style mistakes.

@mblondel mblondel commented on an outdated diff Jan 2, 2013
sklearn/decomposition/pca_2d.py
+ value decomposition. It works on dense matrices. The time complexity is :
+ O(n_row**3 + n_column**3 + n_row**2 * n_samples + n_column**2 * n_samples)
+ where n_samples is the number of examples, n_row and n_column are the
+ dimension of the original 2D matrices. In practice it means that it
+ can be used if n_row, n_column, n_samples are all less than 300.
+ More formally, it can be used as far as one of the term in the
+ complexity is not too large (less than 10**10).
+
+ In most case, it is then significantly faster than standard 1D PCA
+ which has a complexity of O( n_row**3 * n_column**3).
+
+
+ Parameters
+ ----------
+
+ n_row_components : int, None, or string
mblondel
mblondel Jan 2, 2013 Owner

I like the fact that the constructor parameters mirror the ones in PCA. Good!

@mblondel mblondel commented on an outdated diff Jan 2, 2013
sklearn/decomposition/tests/test_pca_2d.py
+import numpy as np
+
+
+from numpy.testing import assert_array_almost_equal
+from numpy.testing import assert_array_equal
+
+from sklearn import datasets
+from sklearn.decomposition import PCA2D
+
+
+digits = datasets.load_digits()
+
+
+def test_correct_shapes():
+
+ rng = np.random.RandomState(0)
mblondel
mblondel Jan 2, 2013 Owner

Lots of ugly blank lines in this test too.

Owner
mblondel commented Jan 2, 2013

Quickly browsing the code, I think that this looks like a good candidate for inclusion. You will need to fix style mistakes and create documentation in the user guide (what we call "narrative documentation"). Before investigating time on the documentation, you can wait to see if other people are also ok with the idea of adding PCA2D to scikit-learn.

Owner
mblondel commented Jan 2, 2013

@amueller Many matrix factorization problems have been extended to more general tensor factorization. I think it could be nice to support some in scikit-learn.

Grid search and cross-validation should a priori be fine since they work on the first axis. For pipelines, we will need transformers for flattening the 3d array to 2d and vice-versa.

Owner
amueller commented Jan 2, 2013

I'm undecided in this but wouldn't oppose.
It seems like a somewhat limited scope for me and it seems like extending in this direction will have some overhead.
It should work with GridSearchCV and pipelining should be fine. I am just a bit sceptical on giving up the (n_samples, n_features) representation as a standard.
So now it will be (n_samples, n_feature_1, n_feature_2, ..., n_feature_n)?

Btw @mblondel if you have some time I'd really appreciate your input on #1485 and #1491.

yedtoss commented Jan 3, 2013

@mblondel It seems that pep8 / autopep8 does not report / correct blank lines. I have looked at other files to correct the blank lines

@amueller I am not sure it is a limited scope because in many applications we unrolled the 2d data to 1d vectors before applying pca. pca2d fixed that.

yedtoss commented Jan 3, 2013

I have a question about PCA/SparsePCA .... with implication to PCA2D
Their transform method expects matrix of shapes (n_samples,n_features), but if only a 1d array of shape (n) is sent to the function, the method transform it to a matrix of shape (n_samples = n, 1). I think that it will be better to consider a matrix of shape (1, n_features = n). What do you think ?
The same question holds for PCA2D

Owner
amueller commented Jan 3, 2013

@yedtoss this should probably be consistent in the whole scikit-learn. I would imagine it is not at the moment.
Also, I'm not sure which version we should prefer. I remember some discussions about this some time ago.
We should definitely first find out which algorithms are behaving which way.

yedtoss commented Jan 3, 2013

@amueller Let's name v1 using shape (n_samples =n , 1) and v2 using shape (1, n_features = n). Then PCA, RandomizedPCA, ProbabilisticPCA, ProjectedGradientNMF, NMF, FactorAnalysis, SparseCoder, DictionaryLearning, MiniBatchDictionaryLearning, FastICA, LDA uses v1
I am not sure about KernelPCA and SparsePCA. PCA2D as written also uses v1. I prefer though v2 since it makes more sense.

Owner
amueller commented Jan 3, 2013

Why do you think it make more sense?
For PCA both don't make sense, I guess. For classifiers I would guess v1 makes more sense - it needs to be compatible with y though, so the behavior will probably be clear to the user.

yedtoss commented Jan 3, 2013

I disagree with you when you said that both don't make sense. I am talking about the transform method and not the fit method. Here is why I think v2 make more sense for decomposition algorithms :

-- I have used n_samples of data to find the principal components of all my data

-- Then I may want to reduce the dimensions of n data where n = 1. In other words I may want to reduce only one data. If one data is a 1d array of shape n_features, then I can send to the transform method a 1d array of shape n_features or a 2d array of shape (1, n_features). So v2 makes more sense

Owner
amueller commented Jan 4, 2013

On 01/04/2013 12:46 AM, yedtoss wrote:

I disagree with you when you said that both don't make sense. I am
talking about the |transform| method and not the |fit| method. Here is
why I think v2 make more sense for decomposition algorithms :

-- I have used n_samples of data to find the principal components of
all my data

-- Then I may want to reduce the dimensions of n data where n = 1. In
other words I may want to reduce only one data. If one data is a 1d
array of shape n_features, then I can send to the |transform| method a
1d array of shape n_features or a 2d array of shape (1, n_features).
So v2 makes more sense

Ok, so for the transform of PCA v2 would be meaningful while v1 would not.
Still, it should be consistent for all of sklearn.
If you have ndim=3 as in your PR, how would you handle that? So if you
have data of shape (n_samples, n_features_1, n_features_2)
and you pass something of shape (n_features_1, n_features_2) it will be
interpreted as (1, n_features_1, n_features_2)?

yedtoss commented Jan 4, 2013

Yes PCA2D should interpret it in the transform method as (1, n_features_1, n_features_2). In the first commits it was done like that. But to be consistent with the other sklearn decompositions, I have finally used (n_samples, n_features_1, 1).

yedtoss commented Jan 16, 2013

Can I now write the narrative documentation for 2DPCA?
@amueller What are your conclusion about the way 1d array is interpreted in the transform method ?
Can I open an issue to change the way 1d array is interpreted in the sklearn decomposition methods?

Owner

i am still unsure about the inclusion of 2d2pca. maybe we should discuss on the mailing list.
for 1d arrays, you could open an issue for discussion. I am very much against changeing the behavior only for decomposition models. it should be consistent for all estimators.

yedtoss notifications@github.com schrieb:

Can I now write the narrative documentation for 2DPCA?
@amueller What are your conclusion about the way 1d array is
interpreted in the transform method ?
Can I open an issue to change the way 1d array is interpreted in the
sklearn decomposition methods?


Reply to this email directly or view it on GitHub:
#1503 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

yedtoss commented Jan 16, 2013

Yes I agree. It should be consistent in all estimators.
I will open a issue soon for the predict and transform method of all estimators

Owner
mblondel commented Jan 8, 2014
yedtoss commented Jan 9, 2014

OK.

Owner
amueller commented Jun 8, 2015

closing as out of scope. Adding to gists / link on website etc still welcome.

@amueller amueller closed this Jun 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment