New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add factor analysis #3294
Conversation
Here is the R code to produce results of library(psych)
Basal = c(2.068, 2.068, 2.09, 2.097, 2.117, 2.14, 2.045, 2.076, 2.09, 2.111, 2.093, 2.1, 2.104)
Occ = c(2.07, 2.074, 2.09, 2.093, 2.125, 2.146, 2.054, 2.088, 2.093, 2.114, 2.098, 2.106, 2.101)
Max = c(1.58, 1.602, 1.613, 1.613, 1.663, 1.681, 1.58, 1.602, 1.643, 1.643, 1.653, 1.623, 1.653)
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Y <- cbind(Basal, Occ, Max, id)
a <- fa(Y, nfactors=2, fm="pa", rotate="none", SMC=TRUE, min.err=1e-10)
a$loadings
a$communality |
(I haven't looked yet) I remember some discussions in mailing lists (not on statsmodels mailing list) about factor rotation. |
@josef-pkt Nice! Glad to know I don't have to do it from scratch. How do I use it? I mean can I copy and paste and acknowledge in file |
If it's license compatible (especially BSD-2, BSD-3 or MIT), then we can just copy the code. For good manners and traceability of the origin: |
Factor rotation added using the factor_rotation package by @mvds314 The following rotation method produces the same results as R fa: The following method produces somewhat different results:
|
@josef-pkt Can I translate some R code into python? |
Almost all R code is GPL licensed which is NOT compatible with our license, BSD-3, and we cannot translate that code. |
In the past (after my transition from matlab to python), I was looking sometimes on the matlab fileexchange for some details, most of it is BSD licensed but not all. |
@josef-pkt I think this is ready for you to review. After it merged, I'm planning to implement other factor analysis method like maximum likelihood |
Good to see that this the code on github is of benefit!
Both the R package and the Python package use the algorithms described here: The Python package is (with explicit permission of the original authors) published under a BSD license on github. Just to mention, the code package on github might be not be working under Python 3 due to the way I imported stuff in the init.py file. (This was the first time I put code in an actual package :) |
#return | ||
return A.dot(T), T | ||
|
||
class unittests(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit tests should go into a statsmodels/multivariate/factor_rotation/tests
directory so they are picked up by nose for automatic testing
|
||
|
||
|
||
class unittests(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, separate unit tests from main modules
@josef-pkt and @mvds314, I wonder how we can collaborate on this. Maybe @mvds314 can submit a separate pull request of the rotation package since it can be used by PCA as well. |
I only looked roughly at the structure in this PR. From my perspective the inclusion of the rotation looks ok. There could be minor git surgery, but nothing serious. We are not squashing all commits before merging (like some other packages), so the inclusion of the rotation will be visible as separate commits. One other part, scree_plot and similar look like overlapping/code duplication from PCA, and we might move the common parts into a reusable form instead of keeping duplicated code. |
@mvds314 Here is the R code I used to produce the results. VarimaxR loadings output
Basal 0.9695076 0.2293770 python loadings outputBasal 0.9883 -0.1259 equamaxR loadingsBasal 0.9893560 -0.11719005 Python loadingsBBasal 0.9918 0.0946 promaxR loadingsBasal 0.9725400 0.04112070 Python loadingsBasal 0.9883 -0.1259 biquartiminR loadingsBasal 0.9725400 0.04112070 Python loadingsBasal 1.0862 0.1144 |
I moved the |
@yl565 I am not sure if completely understand what you comparison you made between R and Python. I would leave the factor computation itself out of it and just focus on the rotation of the factors. Factor construction itself depends on a lot of details (whether or not to normalize and stuff like that, see for example the R function fa) which is not the aim here. What we should compare is this R package: How does this fit which what you found so far? |
After looking at fa.R source code, it looks like using rotate='varimax' will use stats::varimax while using rotate='Varimax' will call GPArotation::Varimax. I can confirm if I use the following code will produce the same results as python: Basal = c(2.068, 2.068, 2.09, 2.097, 2.117, 2.14, 2.045, 2.076, 2.09, 2.111, 2.093, 2.1, 2.104)
Occ = c(2.07, 2.074, 2.09, 2.093, 2.125, 2.146, 2.054, 2.088, 2.093, 2.114, 2.098, 2.106, 2.101)
Max = c(1.58, 1.602, 1.613, 1.613, 1.663, 1.681, 1.58, 1.602, 1.643, 1.643, 1.653, 1.623, 1.653)
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Y <- cbind(Basal, Occ, Max, id)
a <- fa(Y, nfactors=2, fm="pa", rotate="Varimax", SMC=TRUE, min.err=1e-10) Looks like a lot of rotation in R |
@mvds314 Looks like the factor_rotation library is failing the unittest under python 2.7 Can you please help take a look? |
@yl565 I took a look at the failed tests. I must say I don't know much about version control, automatic testing and stuff. The unittests of the github code are running fine on my computer (both on Python 2.7 and 3.5). I cannot find any tests failing here due to the factor rotation code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review (because I was in the multivariate neighborhood)
comments are under the assumption that we want to make this similar to Model/Results pattern for FactorAnalysis and less like the decomposition pattern in PCA and CCA.
statsmodels/multivariate/factor.py
Outdated
self.rotation = rotation | ||
return FactorResults(self) | ||
|
||
def plot_scree(self, ncomp=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a results method
statsmodels/multivariate/factor.py
Outdated
'equamax', 'oblimin', 'parsimax', 'parsimony', | ||
'biquartimin', 'promax']: | ||
raise ValueError('Unknown rotation method %s' % (rotation)) | ||
R = pd.DataFrame(self.endog).corr().values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use numpy.corrcoef(self.endog, rowvar=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as part of data, this could be an attribute.
It doesn't change in repeated calls to fit
statsmodels/multivariate/factor.py
Outdated
# communality | ||
for j in range(len(R)): | ||
R[j, j] = c[j] | ||
L, V = eig(R) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use eigh
?
statsmodels/multivariate/factor.py
Outdated
self.communality = c | ||
# Perform rotation of the loadings | ||
self.loadings_no_rot = np.array(A) | ||
if rotation in ['varimax', 'quartimax', 'biquartimax', 'equamax', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible structural change: given that we can use different rotation with same raw FA
use pattern as for cov_type
- make rotation a fit keyword
- set the rotation in the results class
__init__
- try to avoid any rotation specific attributes in
Factor
and attach those to results instance
to get a new results instance we could make the optimization loop conditional on not having loadings_not_rot
as Factor attribute.
maybe rename loadings_not_rot
-> loadings_raw
(or just loadings
if it's unambiguous in the Factor
class. (while FactorResults has the rotated loading).
statsmodels/multivariate/factor.py
Outdated
""" | ||
return plot_scree(self.eigenvals, self.n_comp, ncomp) | ||
|
||
def plot_loadings(self, loading_pairs=None, plot_prerotated=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also be a results method because it requires that fit
has been called
@@ -0,0 +1,141 @@ | |||
import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matplotlib import needs to be try ... except
protected because it is not a required dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still todo, causes import error on Travis with python 2 without matplotlib
The SAS link mentions SMC under priors, R sounds like it's just starting parameters. question: does SMC have an effect on the final estimate or is it just to get the iterations started? I had thought it applies FA to a differently scaled correlation matrix, but if it is only for starting values, then it might still belong into |
It specifies the initial guess for communality and it has an effect on the final results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few comments to the latest changes.
(I think we are getting close)
statsmodels/multivariate/factor.py
Outdated
|
||
""" | ||
def __init__(self, endog, n_factor, corr=None, method='pa', smc=True, | ||
missing='drop', endog_names=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add also nobs
as Kerby pointed out.
It's not currently used but will be necessary in future when we get some inferential methods.
statsmodels/multivariate/factor.py
Outdated
raise ValueError('The number of columns in endog must be ' | ||
'equal to the number of columns and rows corr') | ||
self.corr = corr | ||
self.endog_names = endog_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corr could be a pandas DataFrame
We need self.corr = np.asarray(corr)
to make sure that we have an array for further processing.
endog_names could be taken from the index or column of the DataFrame,
e.g. if hasattr(corr, index): endog_names=...
statsmodels/multivariate/factor.py
Outdated
if self.endog is not None: | ||
return self.data.ynames | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as default if not available you could make up some names
(that's what we usually do by default in the models)
raise ValueError('The number of elements in endog_names must ' | ||
'be equal to the number of columns and rows in corr') | ||
if self.endog is not None and len(value) != self.endog.shape[1]: | ||
raise ValueError('The number of elements in endog_names must ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could define k_endog
in __init__
to store corr.shape[0] or endog.shape[1]
The the check would be simpler here
statsmodels/multivariate/factor.py
Outdated
|
||
def fit(self, n_max_iter=50, tolerance=1e-6): | ||
""" | ||
Extract factors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fit is the public function and needs the full docstring
maxiter and tol need renaming
Changes applied |
1 similar comment
statsmodels/multivariate/factor.py
Outdated
|
||
""" | ||
def __init__(self, endog, n_factor, corr=None, method='pa', smc=True, | ||
missing='drop', endog_names=None): | ||
# CHeck validity of n_factor | ||
missing='drop', endog_names=None, n_obs=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming is nobs
, which we use in all model
comment to a commit (which might not be very visible): |
1 similar comment
@yl565 Can you rebase this on master and push to a NEW PR? I think we need to add "status: experimental" to the docstrings because I'm pretty sure we have to change things as we add more features. |
OK, I will create a new branch, rebase on the new branch and push it to a new PR. Is it right? |
more precise: rebase the new branch on master I think I will merge after another spot checking. |
def test_example_compare_to_R_output(): | ||
# No rotation without squared multiple correlations prior | ||
# produce same results as in R `fa` | ||
mod = Factor(X.iloc[:, 1:-1], 2, smc=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the unit test is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I need better reading glasses.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, just let me know if there is anything else need to be done
If you have time to work on this some more, then there are a few features to get closer to what e.g. Stata has, e.g. methods for factor scoring. I will look at the new PR and most likely will merge it tomorrow. What we still need is a paragraph, or so, with a brief description for each of the new I will check inclusion in api.py and in rst documentation |
rebased version #4161 merged |
I opened #4164 to keep track of enhancement and refactoring ideas. |
This PR is for adding factor analysis. Already implemented Principal Axes Factor Analysis and the results are identical to R psych library
fa
whenrotation="none"
andfm="pa"
.To-do:
FactorResults
class