Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add factor analysis #3294

Closed
wants to merge 27 commits into from
Closed

Conversation

yl565
Copy link
Contributor

@yl565 yl565 commented Nov 26, 2016

This PR is for adding factor analysis. Already implemented Principal Axes Factor Analysis and the results are identical to R psych library fa when rotation="none" and fm="pa".

To-do:

  • FactorResults class
  • Varimax, Quartimax and other rotations
  • Scree and variance explained plots
  • factor loading plots
  • Input validation
  • More examples

@yl565 yl565 changed the title Add factor analysis [WIP] Add factor analysis Nov 26, 2016
@yl565
Copy link
Contributor Author

yl565 commented Nov 26, 2016

Here is the R code to produce results of test_example_compare_to_R_output() :

library(psych)
Basal = c(2.068,	2.068,	2.09,	2.097,	2.117,	2.14,	2.045,	2.076,	2.09,	2.111,	2.093,	2.1,	2.104)
Occ = c(2.07,	2.074,	2.09,	2.093,	2.125,	2.146,	2.054,	2.088,	2.093,	2.114,	2.098,	2.106,	2.101)
Max = c(1.58,	1.602,	1.613,	1.613,	1.663,	1.681,	1.58,	1.602,	1.643,	1.643,	1.653,	1.623,	1.653)
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Y <- cbind(Basal, Occ, Max, id)
a <- fa(Y, nfactors=2, fm="pa", rotate="none", SMC=TRUE, min.err=1e-10)
a$loadings
a$communality

@josef-pkt
Copy link
Member

(I haven't looked yet)

I remember some discussions in mailing lists (not on statsmodels mailing list) about factor rotation.
There might be some reusable code out there that is license compatible, if needed.
eg. a brief search for "python varimax" shows
https://github.com/mvds314/factor_rotation
which I've never seen before. It's license compatible, looks like BSD-3, same as ours.

@yl565
Copy link
Contributor Author

yl565 commented Nov 26, 2016

@josef-pkt Nice! Glad to know I don't have to do it from scratch. How do I use it? I mean can I copy and paste and acknowledge in file

@josef-pkt
Copy link
Member

If it's license compatible (especially BSD-2, BSD-3 or MIT), then we can just copy the code.

For good manners and traceability of the origin:
If it's just part of a module like one or a few functions, then I just copy them. If it's almost another package, e.g. several modules, then I try to have a clean commit of the unchanged original code where we can specify the author in the commit message. And make adjustments to the code in follow-up commits.
For one large commit, where I essentially copied another package into statsmodels, I also contacted the author to tell him that I'm doing it.

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

Factor rotation added using the factor_rotation package by @mvds314

The following rotation method produces the same results as R fa:
None (as 'none' in R), "quartimax", "oblimin"

The following method produces somewhat different results:
"varimax", "equamax", "promax", "biquartimin"

"varimax" produces the same results as the code here

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

@josef-pkt Can I translate some R code into python?

@josef-pkt
Copy link
Member

Almost all R code is GPL licensed which is NOT compatible with our license, BSD-3, and we cannot translate that code.
We can only translate R code, either if it has the BSD or MIT type license, there are a very few packages with it, or we have the explicit permission by the author.

@josef-pkt
Copy link
Member

In the past (after my transition from matlab to python), I was looking sometimes on the matlab fileexchange for some details, most of it is BSD licensed but not all.
e.g.
https://www.mathworks.com/matlabcentral/fileexchange/?term=tag%3A%22factor+analysis%22
Antonio Trujillo-Ortiz has a nice collection of statistical tests that I looked at (more for inspiration or checking some detail than actual translation)
Julia has a multivariate package but it doesn't have factor models yet. Most Julia packages are MIT licensed and license compatible for us.

@yl565 yl565 changed the title [WIP] Add factor analysis [MRG] Add factor analysis Nov 27, 2016
@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

@josef-pkt I think this is ready for you to review. After it merged, I'm planning to implement other factor analysis method like maximum likelihood

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 87.644% when pulling 8f2895c on yl565:factoranalysis into 5797248 on statsmodels:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 87.644% when pulling 0d127b5 on yl565:factoranalysis into 5797248 on statsmodels:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 87.636% when pulling 0d127b5 on yl565:factoranalysis into 5797248 on statsmodels:master.

@mvds314
Copy link

mvds314 commented Nov 27, 2016

Good to see that this the code on github is of benefit!

The following method produces somewhat different results:
"varimax", "equamax", "promax", "biquartimin"

Both the R package and the Python package use the algorithms described here:
http://www.stat.ucla.edu/research/gpa/
The Python package contains unittests that test whether the code produces the same output as the Matlab/Octave code on the site. In that sense, the code already has been double checked for errors. If there are inconsistencies with the R code you are referring to that need to be resolved, I might be able to help.

The Python package is (with explicit permission of the original authors) published under a BSD license on github.

Just to mention, the code package on github might be not be working under Python 3 due to the way I imported stuff in the init.py file. (This was the first time I put code in an actual package :)

#return
return A.dot(T), T

class unittests(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit tests should go into a statsmodels/multivariate/factor_rotation/tests directory so they are picked up by nose for automatic testing




class unittests(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, separate unit tests from main modules

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

@josef-pkt and @mvds314, I wonder how we can collaborate on this. Maybe @mvds314 can submit a separate pull request of the rotation package since it can be used by PCA as well.

@josef-pkt
Copy link
Member

I only looked roughly at the structure in this PR.

From my perspective the inclusion of the rotation looks ok. There could be minor git surgery, but nothing serious. We are not squashing all commits before merging (like some other packages), so the inclusion of the rotation will be visible as separate commits.
Usage of rotation for PCA and other models when they are available can be added in follow-up PRs.

One other part, scree_plot and similar look like overlapping/code duplication from PCA, and we might move the common parts into a reusable form instead of keeping duplicated code.

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

@mvds314 Here is the R code I used to produce the results.

Varimax

R loadings output

  factor 1    factor 2

Basal 0.9695076 0.2293770
Occ 0.9526215 0.2554104
Max 0.7864679 0.5889237
id 0.1462892 0.5742083

python loadings output

Basal 0.9883 -0.1259
Occ 0.9742 -0.1535
Max 0.8442 -0.5027
id 0.2060 -0.5556

equamax

R loadings

Basal 0.9893560 -0.11719005
Occ 0.9824271 -0.08694325
Max 0.9407897 0.28333295
id 0.3344228 0.48915954

Python loadings

BBasal 0.9918 0.0946
Occ 0.9842 0.0646
Max 0.9341 -0.3047
id 0.3232 -0.4966

promax

R loadings

Basal 0.9725400 0.04112070
Occ 0.9414561 0.07602842
Max 0.5991637 0.51127594
id -0.1074315 0.64634944

Python loadings

Basal 0.9883 -0.1259
Occ 0.9742 -0.1535
Max 0.8442 -0.5027
id 0.2060 -0.5556

biquartimin

R loadings

Basal 0.9725400 0.04112070
Occ 0.9414561 0.07602842
Max 0.5991637 0.51127594
id -0.1074315 0.64634944

Python loadings

Basal 1.0862 0.1144
Occ 1.0371 0.0639
Max 0.4758 -0.5578
id -0.3790 -0.8541

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

I moved the factor_rotation unit test to tests folder and fixed pep8 problems.

@mvds314
Copy link

mvds314 commented Nov 27, 2016

@yl565 I am not sure if completely understand what you comparison you made between R and Python. I would leave the factor computation itself out of it and just focus on the rotation of the factors. Factor construction itself depends on a lot of details (whether or not to normalize and stuff like that, see for example the R function fa) which is not the aim here.

What we should compare is this R package:
https://cran.r-project.org/web/packages/GPArotation/GPArotation.pdf
with the Python package I put on github. For example, choose for favorite matrix Y and rotate it with varimax. The R code then becomes:
GPArotation::Varimax(Y,eps=1e-5,maxit=100000)
And the Python code becomes
L1,T= rotate_factors(df.values,'varimax', max_tries=100000,tol=1e-5)
I only did a very brief check, but I don't see any differences so far up to several decimals in the examples I looked at.

How does this fit which what you found so far?

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

After looking at fa.R source code, it looks like using rotate='varimax' will use stats::varimax while using rotate='Varimax' will call GPArotation::Varimax. I can confirm if I use the following code will produce the same results as python:

Basal = c(2.068,	2.068,	2.09,	2.097,	2.117,	2.14,	2.045,	2.076,	2.09,	2.111,	2.093,	2.1,	2.104)
Occ = c(2.07,	2.074,	2.09,	2.093,	2.125,	2.146,	2.054,	2.088,	2.093,	2.114,	2.098,	2.106,	2.101)
Max = c(1.58,	1.602,	1.613,	1.613,	1.663,	1.681,	1.58,	1.602,	1.643,	1.643,	1.653,	1.623,	1.653)
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Y <- cbind(Basal, Occ, Max, id)
a <- fa(Y, nfactors=2, fm="pa", rotate="Varimax", SMC=TRUE, min.err=1e-10)

Looks like a lot of rotation in R fa does not use GPArotation library. I think we are fine here since this is exploratory analysis and different way to rotate the loadings may produce different results.

@yl565
Copy link
Contributor Author

yl565 commented Nov 27, 2016

@mvds314 Looks like the factor_rotation library is failing the unittest under python 2.7 Can you please help take a look?

@mvds314
Copy link

mvds314 commented Nov 28, 2016

@yl565 I took a look at the failed tests. I must say I don't know much about version control, automatic testing and stuff.

The unittests of the github code are running fine on my computer (both on Python 2.7 and 3.5).

I cannot find any tests failing here due to the factor rotation code:
https://travis-ci.org/statsmodels/statsmodels/builds/179298396
Could you be more specific on what test is failing?

@josef-pkt josef-pkt added this to needs_review in try_josef_queue Feb 26, 2017
Copy link
Member

@josef-pkt josef-pkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review (because I was in the multivariate neighborhood)

comments are under the assumption that we want to make this similar to Model/Results pattern for FactorAnalysis and less like the decomposition pattern in PCA and CCA.

self.rotation = rotation
return FactorResults(self)

def plot_scree(self, ncomp=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a results method

'equamax', 'oblimin', 'parsimax', 'parsimony',
'biquartimin', 'promax']:
raise ValueError('Unknown rotation method %s' % (rotation))
R = pd.DataFrame(self.endog).corr().values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use numpy.corrcoef(self.endog, rowvar=0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as part of data, this could be an attribute.
It doesn't change in repeated calls to fit

# communality
for j in range(len(R)):
R[j, j] = c[j]
L, V = eig(R)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use eigh ?

self.communality = c
# Perform rotation of the loadings
self.loadings_no_rot = np.array(A)
if rotation in ['varimax', 'quartimax', 'biquartimax', 'equamax',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible structural change: given that we can use different rotation with same raw FA
use pattern as for cov_type

  • make rotation a fit keyword
  • set the rotation in the results class __init__
  • try to avoid any rotation specific attributes in Factor and attach those to results instance

to get a new results instance we could make the optimization loop conditional on not having loadings_not_rot as Factor attribute.

maybe rename loadings_not_rot -> loadings_raw (or just loadings if it's unambiguous in the Factor class. (while FactorResults has the rotated loading).

"""
return plot_scree(self.eigenvals, self.n_comp, ncomp)

def plot_loadings(self, loading_pairs=None, plot_prerotated=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also be a results method because it requires that fit has been called

@@ -0,0 +1,141 @@
import matplotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib import needs to be try ... except protected because it is not a required dependency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still todo, causes import error on Travis with python 2 without matplotlib

@josef-pkt
Copy link
Member

The SAS link mentions SMC under priors, R sounds like it's just starting parameters.

question: does SMC have an effect on the final estimate or is it just to get the iterations started?

I had thought it applies FA to a differently scaled correlation matrix, but if it is only for starting values, then it might still belong into fit and not into __init__.

@yl565
Copy link
Contributor Author

yl565 commented Dec 10, 2017

It specifies the initial guess for communality and it has an effect on the final results.

Copy link
Member

@josef-pkt josef-pkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few comments to the latest changes.

(I think we are getting close)


"""
def __init__(self, endog, n_factor, corr=None, method='pa', smc=True,
missing='drop', endog_names=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add also nobs as Kerby pointed out.
It's not currently used but will be necessary in future when we get some inferential methods.

raise ValueError('The number of columns in endog must be '
'equal to the number of columns and rows corr')
self.corr = corr
self.endog_names = endog_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corr could be a pandas DataFrame
We need self.corr = np.asarray(corr) to make sure that we have an array for further processing.
endog_names could be taken from the index or column of the DataFrame,
e.g. if hasattr(corr, index): endog_names=...

if self.endog is not None:
return self.data.ynames
else:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as default if not available you could make up some names
(that's what we usually do by default in the models)

raise ValueError('The number of elements in endog_names must '
'be equal to the number of columns and rows in corr')
if self.endog is not None and len(value) != self.endog.shape[1]:
raise ValueError('The number of elements in endog_names must '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could define k_endog in __init__ to store corr.shape[0] or endog.shape[1]
The the check would be simpler here


def fit(self, n_max_iter=50, tolerance=1e-6):
"""
Extract factors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit is the public function and needs the full docstring

maxiter and tol need renaming

@coveralls
Copy link

coveralls commented Dec 11, 2017

Coverage Status

Coverage increased (+0.04%) to 81.956% when pulling 626a6b0 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

@yl565
Copy link
Contributor Author

yl565 commented Dec 11, 2017

Changes applied

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 81.955% when pulling 4fa1309 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 81.955% when pulling 4fa1309 on yl565:factoranalysis into 86e3d4b on statsmodels:master.


"""
def __init__(self, endog, n_factor, corr=None, method='pa', smc=True,
missing='drop', endog_names=None):
# CHeck validity of n_factor
missing='drop', endog_names=None, n_obs=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is nobs, which we use in all model

@coveralls
Copy link

coveralls commented Dec 11, 2017

Coverage Status

Coverage increased (+0.04%) to 81.955% when pulling 4fa1309 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

@josef-pkt
Copy link
Member

comment to a commit (which might not be very visible):
naming is nobs, which we use in all model
i.e. n_obs -> nobs

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 81.958% when pulling 47ef9d2 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 81.958% when pulling 47ef9d2 on yl565:factoranalysis into 86e3d4b on statsmodels:master.

@josef-pkt
Copy link
Member

@yl565 Can you rebase this on master and push to a NEW PR?
Using a new PR avoids that we mess up the history of this PR, I don't think we have to squash commits.
(If you have problems, then I can do the rebase. )

I think we need to add "status: experimental" to the docstrings because I'm pretty sure we have to change things as we add more features.
(I'm still not familiar enough we FA to tell what we will need. The latest I tried, is to get the factors and prediction on endog based on factors, i.e. observation specific information. However, it's not as easy as in PCA. Additionally, we need the reconstituted cov_endog from the factor model. )

@yl565
Copy link
Contributor Author

yl565 commented Dec 12, 2017

OK, I will create a new branch, rebase on the new branch and push it to a new PR. Is it right?

@josef-pkt
Copy link
Member

more precise: rebase the new branch on master
Otherwise, yes that's it.

I think I will merge after another spot checking.
But there is still work to do after the merge, e.g. AFAICS, smc=False still doesn't have unit tests.

def test_example_compare_to_R_output():
# No rotation without squared multiple correlations prior
# produce same results as in R `fa`
mod = Factor(X.iloc[:, 1:-1], 2, smc=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the unit test is here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I need better reading glasses.
Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, just let me know if there is anything else need to be done

@josef-pkt
Copy link
Member

If you have time to work on this some more, then there are a few features to get closer to what e.g. Stata has, e.g. methods for factor scoring.

I will look at the new PR and most likely will merge it tomorrow.

What we still need is a paragraph, or so, with a brief description for each of the new multivariate features that you added for inclusion in the release notes.

I will check inclusion in api.py and in rst documentation

@coveralls
Copy link

coveralls commented Dec 12, 2017

Coverage Status

Coverage increased (+0.04%) to 81.958% when pulling 653a43f on yl565:factoranalysis into 86e3d4b on statsmodels:master.

@josef-pkt
Copy link
Member

rebased version #4161 merged

@josef-pkt
Copy link
Member

I opened #4164 to keep track of enhancement and refactoring ideas.

@josef-pkt josef-pkt moved this from needs_review to done in try_josef_queue Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants