[MRG] Add factor analysis #3294

yl565 · 2016-11-26T18:25:48Z

This PR is for adding factor analysis. Already implemented Principal Axes Factor Analysis and the results are identical to R psych library fa when rotation="none" and fm="pa".

To-do:

yl565 · 2016-11-26T18:37:58Z

Here is the R code to produce results of test_example_compare_to_R_output() :

library(psych)
Basal = c(2.068,	2.068,	2.09,	2.097,	2.117,	2.14,	2.045,	2.076,	2.09,	2.111,	2.093,	2.1,	2.104)
Occ = c(2.07,	2.074,	2.09,	2.093,	2.125,	2.146,	2.054,	2.088,	2.093,	2.114,	2.098,	2.106,	2.101)
Max = c(1.58,	1.602,	1.613,	1.613,	1.663,	1.681,	1.58,	1.602,	1.643,	1.643,	1.653,	1.623,	1.653)
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Y <- cbind(Basal, Occ, Max, id)
a <- fa(Y, nfactors=2, fm="pa", rotate="none", SMC=TRUE, min.err=1e-10)
a$loadings
a$communality

josef-pkt · 2016-11-26T19:04:03Z

(I haven't looked yet)

I remember some discussions in mailing lists (not on statsmodels mailing list) about factor rotation.
There might be some reusable code out there that is license compatible, if needed.
eg. a brief search for "python varimax" shows
https://github.com/mvds314/factor_rotation
which I've never seen before. It's license compatible, looks like BSD-3, same as ours.

yl565 · 2016-11-26T19:08:57Z

@josef-pkt Nice! Glad to know I don't have to do it from scratch. How do I use it? I mean can I copy and paste and acknowledge in file

josef-pkt · 2016-11-26T19:50:04Z

If it's license compatible (especially BSD-2, BSD-3 or MIT), then we can just copy the code.

For good manners and traceability of the origin:
If it's just part of a module like one or a few functions, then I just copy them. If it's almost another package, e.g. several modules, then I try to have a clean commit of the unchanged original code where we can specify the author in the commit message. And make adjustments to the code in follow-up commits.
For one large commit, where I essentially copied another package into statsmodels, I also contacted the author to tell him that I'm doing it.

yl565 · 2016-11-27T00:22:12Z

Factor rotation added using the factor_rotation package by @mvds314

The following rotation method produces the same results as R fa:
None (as 'none' in R), "quartimax", "oblimin"

The following method produces somewhat different results:
"varimax", "equamax", "promax", "biquartimin"

"varimax" produces the same results as the code here

yl565 · 2016-11-27T00:25:26Z

@josef-pkt Can I translate some R code into python?

josef-pkt · 2016-11-27T00:52:28Z

Almost all R code is GPL licensed which is NOT compatible with our license, BSD-3, and we cannot translate that code.
We can only translate R code, either if it has the BSD or MIT type license, there are a very few packages with it, or we have the explicit permission by the author.

josef-pkt · 2016-11-27T01:04:05Z

In the past (after my transition from matlab to python), I was looking sometimes on the matlab fileexchange for some details, most of it is BSD licensed but not all.
e.g.
https://www.mathworks.com/matlabcentral/fileexchange/?term=tag%3A%22factor+analysis%22
Antonio Trujillo-Ortiz has a nice collection of statistical tests that I looked at (more for inspiration or checking some detail than actual translation)
Julia has a multivariate package but it doesn't have factor models yet. Most Julia packages are MIT licensed and license compatible for us.

yl565 · 2016-11-27T04:21:36Z

@josef-pkt I think this is ready for you to review. After it merged, I'm planning to implement other factor analysis method like maximum likelihood

coveralls · 2016-11-27T04:24:13Z

Coverage decreased (-0.3%) to 87.644% when pulling 8f2895c on yl565:factoranalysis into 5797248 on statsmodels:master.

coveralls · 2016-11-27T04:40:14Z

Coverage decreased (-0.3%) to 87.644% when pulling 0d127b5 on yl565:factoranalysis into 5797248 on statsmodels:master.

coveralls · 2016-11-27T04:59:56Z

Coverage decreased (-0.3%) to 87.636% when pulling 0d127b5 on yl565:factoranalysis into 5797248 on statsmodels:master.

mvds314 · 2016-11-27T15:29:38Z

Good to see that this the code on github is of benefit!

The following method produces somewhat different results:
"varimax", "equamax", "promax", "biquartimin"

Both the R package and the Python package use the algorithms described here:
http://www.stat.ucla.edu/research/gpa/
The Python package contains unittests that test whether the code produces the same output as the Matlab/Octave code on the site. In that sense, the code already has been double checked for errors. If there are inconsistencies with the R code you are referring to that need to be resolved, I might be able to help.

The Python package is (with explicit permission of the original authors) published under a BSD license on github.

Just to mention, the code package on github might be not be working under Python 3 due to the way I imported stuff in the init.py file. (This was the first time I put code in an actual package :)

josef-pkt · 2016-11-27T16:10:06Z

statsmodels/multivariate/factor_rotation/_analytic_rotation.py

+    #return
+    return A.dot(T), T
+
+class unittests(unittest.TestCase):


unit tests should go into a statsmodels/multivariate/factor_rotation/tests directory so they are picked up by nose for automatic testing

josef-pkt · 2016-11-27T16:11:11Z

statsmodels/multivariate/factor_rotation/_gpa_rotation.py

+
+
+
+class unittests(unittest.TestCase):


same here, separate unit tests from main modules

yl565 · 2016-11-27T16:16:07Z

@josef-pkt and @mvds314, I wonder how we can collaborate on this. Maybe @mvds314 can submit a separate pull request of the rotation package since it can be used by PCA as well.

josef-pkt · 2016-11-27T16:30:26Z

I only looked roughly at the structure in this PR.

From my perspective the inclusion of the rotation looks ok. There could be minor git surgery, but nothing serious. We are not squashing all commits before merging (like some other packages), so the inclusion of the rotation will be visible as separate commits.
Usage of rotation for PCA and other models when they are available can be added in follow-up PRs.

One other part, scree_plot and similar look like overlapping/code duplication from PCA, and we might move the common parts into a reusable form instead of keeping duplicated code.

yl565 · 2016-11-27T16:52:39Z

@mvds314 Here is the R code I used to produce the results.

Varimax

R loadings output

  factor 1    factor 2

Basal 0.9695076 0.2293770
Occ 0.9526215 0.2554104
Max 0.7864679 0.5889237
id 0.1462892 0.5742083

python loadings output

Basal 0.9883 -0.1259
Occ 0.9742 -0.1535
Max 0.8442 -0.5027
id 0.2060 -0.5556

equamax

R loadings

Basal 0.9893560 -0.11719005
Occ 0.9824271 -0.08694325
Max 0.9407897 0.28333295
id 0.3344228 0.48915954

Python loadings

BBasal 0.9918 0.0946
Occ 0.9842 0.0646
Max 0.9341 -0.3047
id 0.3232 -0.4966

promax

R loadings

Basal 0.9725400 0.04112070
Occ 0.9414561 0.07602842
Max 0.5991637 0.51127594
id -0.1074315 0.64634944

Python loadings

Basal 0.9883 -0.1259
Occ 0.9742 -0.1535
Max 0.8442 -0.5027
id 0.2060 -0.5556

biquartimin

R loadings

Basal 0.9725400 0.04112070
Occ 0.9414561 0.07602842
Max 0.5991637 0.51127594
id -0.1074315 0.64634944

Python loadings

Basal 1.0862 0.1144
Occ 1.0371 0.0639
Max 0.4758 -0.5578
id -0.3790 -0.8541

yl565 · 2016-11-27T18:16:56Z

I moved the factor_rotation unit test to tests folder and fixed pep8 problems.

mvds314 · 2016-11-27T21:28:38Z

@yl565 I am not sure if completely understand what you comparison you made between R and Python. I would leave the factor computation itself out of it and just focus on the rotation of the factors. Factor construction itself depends on a lot of details (whether or not to normalize and stuff like that, see for example the R function fa) which is not the aim here.

What we should compare is this R package:
https://cran.r-project.org/web/packages/GPArotation/GPArotation.pdf
with the Python package I put on github. For example, choose for favorite matrix Y and rotate it with varimax. The R code then becomes:
GPArotation::Varimax(Y,eps=1e-5,maxit=100000)
And the Python code becomes
L1,T= rotate_factors(df.values,'varimax', max_tries=100000,tol=1e-5)
I only did a very brief check, but I don't see any differences so far up to several decimals in the examples I looked at.

How does this fit which what you found so far?

yl565 · 2016-11-27T22:15:21Z

After looking at fa.R source code, it looks like using rotate='varimax' will use stats::varimax while using rotate='Varimax' will call GPArotation::Varimax. I can confirm if I use the following code will produce the same results as python:

Basal = c(2.068,	2.068,	2.09,	2.097,	2.117,	2.14,	2.045,	2.076,	2.09,	2.111,	2.093,	2.1,	2.104)
Occ = c(2.07,	2.074,	2.09,	2.093,	2.125,	2.146,	2.054,	2.088,	2.093,	2.114,	2.098,	2.106,	2.101)
Max = c(1.58,	1.602,	1.613,	1.613,	1.663,	1.681,	1.58,	1.602,	1.643,	1.643,	1.653,	1.623,	1.653)
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Y <- cbind(Basal, Occ, Max, id)
a <- fa(Y, nfactors=2, fm="pa", rotate="Varimax", SMC=TRUE, min.err=1e-10)

Looks like a lot of rotation in R fa does not use GPArotation library. I think we are fine here since this is exploratory analysis and different way to rotate the loadings may produce different results.

yl565 · 2016-11-27T22:18:28Z

@mvds314 Looks like the factor_rotation library is failing the unittest under python 2.7 Can you please help take a look?

mvds314 · 2016-11-28T09:17:18Z

@yl565 I took a look at the failed tests. I must say I don't know much about version control, automatic testing and stuff.

The unittests of the github code are running fine on my computer (both on Python 2.7 and 3.5).

I cannot find any tests failing here due to the factor rotation code:
https://travis-ci.org/statsmodels/statsmodels/builds/179298396
Could you be more specific on what test is failing?

josef-pkt

partial review (because I was in the multivariate neighborhood)

comments are under the assumption that we want to make this similar to Model/Results pattern for FactorAnalysis and less like the decomposition pattern in PCA and CCA.

josef-pkt · 2017-11-12T14:17:13Z

statsmodels/multivariate/factor.py

+        self.rotation = rotation
+        return FactorResults(self)
+
+    def plot_scree(self, ncomp=None):


this should be a results method

josef-pkt · 2017-11-12T14:18:47Z

statsmodels/multivariate/factor.py

+                            'equamax', 'oblimin', 'parsimax', 'parsimony',
+                            'biquartimin', 'promax']:
+            raise ValueError('Unknown rotation method %s' % (rotation))
+        R = pd.DataFrame(self.endog).corr().values


use numpy.corrcoef(self.endog, rowvar=0)

as part of data, this could be an attribute.
It doesn't change in repeated calls to fit

josef-pkt · 2017-11-12T14:21:50Z

statsmodels/multivariate/factor.py

+            # communality
+            for j in range(len(R)):
+                R[j, j] = c[j]
+            L, V = eig(R)


use eigh ?

josef-pkt · 2017-11-12T14:30:12Z

statsmodels/multivariate/factor.py

+        self.communality = c
+        # Perform rotation of the loadings
+        self.loadings_no_rot = np.array(A)
+        if rotation in ['varimax', 'quartimax', 'biquartimax', 'equamax',


possible structural change: given that we can use different rotation with same raw FA
use pattern as for cov_type

make rotation a fit keyword

set the rotation in the results class __init__

try to avoid any rotation specific attributes in Factor and attach those to results instance

to get a new results instance we could make the optimization loop conditional on not having loadings_not_rot as Factor attribute.

maybe rename loadings_not_rot -> loadings_raw (or just loadings if it's unambiguous in the Factor class. (while FactorResults has the rotated loading).

josef-pkt · 2017-11-12T14:31:16Z

statsmodels/multivariate/factor.py

+        """
+        return plot_scree(self.eigenvals, self.n_comp, ncomp)
+
+    def plot_loadings(self, loading_pairs=None, plot_prerotated=False):


this should also be a results method because it requires that fit has been called

josef-pkt · 2017-11-12T14:43:47Z

statsmodels/multivariate/plots.py

@@ -0,0 +1,141 @@
+import matplotlib.pyplot as plt


matplotlib import needs to be try ... except protected because it is not a required dependency

still todo, causes import error on Travis with python 2 without matplotlib

josef-pkt · 2017-12-10T22:41:52Z

The SAS link mentions SMC under priors, R sounds like it's just starting parameters.

question: does SMC have an effect on the final estimate or is it just to get the iterations started?

I had thought it applies FA to a differently scaled correlation matrix, but if it is only for starting values, then it might still belong into fit and not into __init__.

yl565 · 2017-12-10T22:54:48Z

It specifies the initial guess for communality and it has an effect on the final results.

josef-pkt

I made a few comments to the latest changes.

(I think we are getting close)

josef-pkt · 2017-12-11T00:32:46Z

statsmodels/multivariate/factor.py

+
+    """
+    def __init__(self, endog, n_factor, corr=None, method='pa', smc=True,
+                 missing='drop', endog_names=None):


add also nobs as Kerby pointed out.
It's not currently used but will be necessary in future when we get some inferential methods.

josef-pkt · 2017-12-11T00:35:12Z

statsmodels/multivariate/factor.py

+                    raise ValueError('The number of columns in endog must be '
+                                     'equal to the number of columns and rows corr')
+        self.corr = corr
+        self.endog_names = endog_names


corr could be a pandas DataFrame
We need self.corr = np.asarray(corr) to make sure that we have an array for further processing.
endog_names could be taken from the index or column of the DataFrame,
e.g. if hasattr(corr, index): endog_names=...

josef-pkt · 2017-12-11T00:36:14Z

statsmodels/multivariate/factor.py

+            if self.endog is not None:
+                return self.data.ynames
+            else:
+                return None


as default if not available you could make up some names
(that's what we usually do by default in the models)

josef-pkt · 2017-12-11T00:37:55Z

statsmodels/multivariate/factor.py

+                raise ValueError('The number of elements in endog_names must '
+                                 'be equal to the number of columns and rows in corr')
+            if self.endog is not None and len(value) != self.endog.shape[1]:
+                raise ValueError('The number of elements in endog_names must '


You could define k_endog in __init__ to store corr.shape[0] or endog.shape[1]
The the check would be simpler here

josef-pkt · 2017-12-11T00:38:34Z

statsmodels/multivariate/factor.py

+
+    def fit(self, n_max_iter=50, tolerance=1e-6):
+        """
+        Extract factors


fit is the public function and needs the full docstring

maxiter and tol need renaming

coveralls · 2017-12-11T00:59:01Z

Coverage increased (+0.04%) to 81.956% when pulling 626a6b0 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

yl565 · 2017-12-11T01:46:18Z

Changes applied

coveralls · 2017-12-11T02:34:58Z

Coverage increased (+0.04%) to 81.955% when pulling 4fa1309 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

coveralls · 2017-12-11T02:34:58Z

Coverage increased (+0.04%) to 81.955% when pulling 4fa1309 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

josef-pkt · 2017-12-11T02:57:04Z

statsmodels/multivariate/factor.py


    """
    def __init__(self, endog, n_factor, corr=None, method='pa', smc=True,
-                 missing='drop', endog_names=None):
-        # CHeck validity of n_factor
+                 missing='drop', endog_names=None, n_obs=None):


naming is nobs, which we use in all model

coveralls · 2017-12-11T03:00:11Z

Coverage increased (+0.04%) to 81.955% when pulling 4fa1309 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

josef-pkt · 2017-12-11T03:01:58Z

comment to a commit (which might not be very visible):
naming is nobs, which we use in all model
i.e. n_obs -> nobs

coveralls · 2017-12-11T04:15:17Z

Coverage increased (+0.04%) to 81.958% when pulling 47ef9d2 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

coveralls · 2017-12-11T04:15:17Z

Coverage increased (+0.04%) to 81.958% when pulling 47ef9d2 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

josef-pkt · 2017-12-12T01:51:20Z

@yl565 Can you rebase this on master and push to a NEW PR?
Using a new PR avoids that we mess up the history of this PR, I don't think we have to squash commits.
(If you have problems, then I can do the rebase. )

I think we need to add "status: experimental" to the docstrings because I'm pretty sure we have to change things as we add more features.
(I'm still not familiar enough we FA to tell what we will need. The latest I tried, is to get the factors and prediction on endog based on factors, i.e. observation specific information. However, it's not as easy as in PCA. Additionally, we need the reconstituted cov_endog from the factor model. )

yl565 · 2017-12-12T01:56:31Z

OK, I will create a new branch, rebase on the new branch and push it to a new PR. Is it right?

josef-pkt · 2017-12-12T02:03:38Z

more precise: rebase the new branch on master
Otherwise, yes that's it.

I think I will merge after another spot checking.
But there is still work to do after the merge, e.g. AFAICS, smc=False still doesn't have unit tests.

yl565 · 2017-12-12T02:06:20Z

statsmodels/multivariate/tests/test_factor.py

+def test_example_compare_to_R_output():
+    # No rotation without squared multiple correlations prior
+    # produce same results as in R `fa`
+    mod = Factor(X.iloc[:, 1:-1], 2, smc=False)


Actually, the unit test is here

I guess I need better reading glasses.
Thanks

No problem, just let me know if there is anything else need to be done

josef-pkt · 2017-12-12T02:42:10Z

If you have time to work on this some more, then there are a few features to get closer to what e.g. Stata has, e.g. methods for factor scoring.

I will look at the new PR and most likely will merge it tomorrow.

What we still need is a paragraph, or so, with a brief description for each of the new multivariate features that you added for inclusion in the release notes.

I will check inclusion in api.py and in rst documentation

coveralls · 2017-12-12T02:56:43Z

Coverage increased (+0.04%) to 81.958% when pulling 653a43f on yl565:factoranalysis into 86e3d4b on statsmodels:master.

josef-pkt · 2017-12-14T15:01:28Z

rebased version #4161 merged

josef-pkt · 2017-12-14T15:12:18Z

I opened #4164 to keep track of enhancement and refactoring ideas.

yl565 changed the title ~~Add factor analysis~~ [WIP] Add factor analysis Nov 26, 2016

josef-pkt added comp-multivariate type-enh labels Nov 26, 2016

yl565 changed the title ~~[WIP] Add factor analysis~~ [MRG] Add factor analysis Nov 27, 2016

josef-pkt reviewed Nov 27, 2016

View reviewed changes

josef-pkt added this to needs_review in try_josef_queue Feb 26, 2017

josef-pkt reviewed Nov 12, 2017

View reviewed changes

yl565 added 2 commits December 10, 2017 19:16

Add support to use correlation matrix directly

e5b4f00

Add tests

626a6b0

josef-pkt reviewed Dec 11, 2017

View reviewed changes

Add n_obs, add docstrings, add variable auto naming

eedb869

Improve docstrings

4fa1309

josef-pkt reviewed Dec 11, 2017

View reviewed changes

change n_obs -> nobs, add a test for nobs

47ef9d2

yl565 commented Dec 12, 2017

View reviewed changes

Add "Status: experimental"

653a43f

yl565 mentioned this pull request Dec 12, 2017

Principal axis factor analysis and rotation #4161

Merged

josef-pkt closed this Dec 14, 2017

josef-pkt mentioned this pull request Dec 14, 2017

ENH: factor analysis follow-up #4164

Open

josef-pkt moved this from needs_review to done in try_josef_queue Apr 16, 2018

[MRG] Add factor analysis #3294

[MRG] Add factor analysis #3294

Conversation

yl565 commented Nov 26, 2016 • edited

yl565 commented Nov 26, 2016

josef-pkt commented Nov 26, 2016

yl565 commented Nov 26, 2016

josef-pkt commented Nov 26, 2016

yl565 commented Nov 27, 2016

yl565 commented Nov 27, 2016

josef-pkt commented Nov 27, 2016

josef-pkt commented Nov 27, 2016

yl565 commented Nov 27, 2016

coveralls commented Nov 27, 2016

coveralls commented Nov 27, 2016

coveralls commented Nov 27, 2016

mvds314 commented Nov 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yl565 commented Nov 27, 2016

josef-pkt commented Nov 27, 2016

yl565 commented Nov 27, 2016

Varimax

R loadings output

python loadings output

equamax

R loadings

Python loadings

promax

R loadings

Python loadings

biquartimin

R loadings

Python loadings

yl565 commented Nov 27, 2016

mvds314 commented Nov 27, 2016 • edited

yl565 commented Nov 27, 2016

yl565 commented Nov 27, 2016

mvds314 commented Nov 28, 2016 • edited

josef-pkt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josef-pkt commented Dec 10, 2017

yl565 commented Dec 10, 2017

josef-pkt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Dec 11, 2017 • edited

yl565 commented Dec 11, 2017

coveralls commented Dec 11, 2017

coveralls commented Dec 11, 2017

Choose a reason for hiding this comment

coveralls commented Dec 11, 2017 • edited

josef-pkt commented Dec 11, 2017

coveralls commented Dec 11, 2017

coveralls commented Dec 11, 2017

josef-pkt commented Dec 12, 2017

yl565 commented Dec 12, 2017

josef-pkt commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josef-pkt commented Dec 12, 2017

coveralls commented Dec 12, 2017 • edited

josef-pkt commented Dec 14, 2017

josef-pkt commented Dec 14, 2017

yl565 commented Nov 26, 2016 •

edited

mvds314 commented Nov 27, 2016 •

edited

mvds314 commented Nov 28, 2016 •

edited

coveralls commented Dec 11, 2017 •

edited

coveralls commented Dec 11, 2017 •

edited

coveralls commented Dec 12, 2017 •

edited