Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP ENH: cov and matrix tools to work with covariances and correlations #4143

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

josef-pkt
Copy link
Member

@josef-pkt josef-pkt commented Nov 27, 2017

a semi-random collection of matrix function and estimators for the covariance of sample correlation and covariance matrices.

very unfinished. I didn't even try out all functions. API needs a lot of work. This is partially an "academic" exercise understanding how to work with and compute the asymptotic covariance of covariance and correlation matrices.

In contrast to #2765, I didn't find much in multivariate textbooks and not in Stata. There is a psych package in R that has many correlation functions or hypothesis tests. This PR is written based on a collection of articles going back to 1970. (references will follow)

I'm not sure yet about how and where this fits in. currently everything is in private module cov_tools. Overall this goes in the direction of modeling second moments, so far statsmodels is mostly focused on first moments.

  • include matrix tools ENH: matrix tools for covariances #4140 (names not informative nor pep-8 compliant, written based on the first article that I looked at when starting to write code. Neudecker changes the capital letter for various matrices across articles)
  • cov_cov and cov_corr: asymptotic covariance of a covariance and correlation estimator targeting various patters/structures or linear restrictions on those. Main assumptions are one of normal, elliptical and distribution free, which in this case affects the assumptions on the 4th moments.
  • eventually the target is structured covariance and correlation estimators with either MLE, iterated GLS and GMM. They are all similar but but use different weighting matrices and the assumptions that underly those.
  • hypothesis tests on correlation coefficients: Once we have the cov_corr, the rest is relatively standard.
  • hypothesis tests on patterns of cov and corr: Most hypothesis tests in SUMM/ENH: hypothesis tests for covariance/correlation matrices #2765 is base on likelihood ratio tests. The literature underlying this PR goes more in the direction of score/LM and Wald tests.

related and not in here: GMM/unrestricted 4th moments can have bad small sample properties. There are some articles that use F instead of the usual chisquare distribution to improve this.

related, but nothing available yet: kurtosis measures to have better (approximation to) asymptotic distribution of test statistic in elliptically symmetric case, e.g. #3280 (several, many articles for multivariate methods). (I don't know if we need multivariate Mardia kurtosis or some average univariate measure)

list of references (incomplete, but most of what I looked at)

Magnus, Jan R., and H. Neudecker. 1986. “Symmetry, 0-1 Matrices and Jacobians: A Review.” Econometric Theory 2 (2):157–90.
Neudecker, H., and A. M. Wesselman. 1990. “The Asymptotic Variance Matrix of the Sample Correlation Matrix.” Linear Algebra and Its Applications 127 (January):589–99. https://doi.org/10.1016/0024-3795(90)90363-H.
Neudecker, Heinz. 1996. “The Asymptotic Variance Matrices of the Sample Correlation Matrix in Elliptical and Normal Situations and Their Proportionality.” Linear Algebra and Its Applications, Linear Algebra and Statistics: In Celebration of C. R. Rao’s 75th Birthday (September 10, 1995), 237 (April):127–32. https://doi.org/10.1016/0024-3795(95)00351-7.
Satorra, Albert, and Heinz Neudecker. 1997. “Compact Matrix Expressions for Generalized Wald Tests of Equality of Moment Vectors.” Journal of Multivariate Analysis 63 (2):259–76. https://doi.org/10.1006/jmva.1997.1696.
Steiger, James H., and A. Ralph Hakstian. 1982. “The Asymptotic Distribution of Elements of a Correlation Matrix: Theory and Application.” British Journal of Mathematical and Statistical Psychology 35 (2):208–15. https://doi.org/10.1111/j.2044-8317.1982.tb00653.x.

Browne, M. W. 1974. “Generalized Least Squares Estimators in the Analysis of Covariance Structures.” South African Statistical Journal 8 (1):1–24.
———. 1977. “The Analysis of Patterned Correlation Matrices by Generalized Least Squares.” British Journal of Mathematical and Statistical Psychology 30 (1):113–24. https://doi.org/10.1111/j.2044-8317.1977.tb00730.x.
———. 1984. “Asymptotically Distribution-Free Methods for the Analysis of Covariance Structures.” British Journal of Mathematical and Statistical Psychology 37 (1):62–83. https://doi.org/10.1111/j.2044-8317.1984.tb00789.x.
Huang, Yafei, and Peter M. Bentler. 2015. “Behavior of Asymptotically Distribution Free Test Statistics in Covariance Versus Correlation Structure Analysis.” Structural Equation Modeling: A Multidisciplinary Journal 22 (4):489–503. https://doi.org/10.1080/10705511.2014.954078.
Steiger, James H. 1980a. “Tests for Comparing Elements of a Correlation Matrix.” Psychological Bulletin 87 (2):245–51. https://doi.org/10.1037/0033-2909.87.2.245.
———. 1980b. “Testing Pattern Hypotheses On Correlation Matrices: Alternative Statistics And Some Empirical Results.” Multivariate Behavioral Research 15 (3):335–52. https://doi.org/10.1207/s15327906mbr1503_7.

Yuan, K.-H., and P. M. Bentler. 1999. “F Tests for Mean and Covariance Structure Analysis.” Journal of Educational and Behavioral Statistics 24 (3):225–43. https://doi.org/10.3102/10769986024003225.

Hayakawa, Kazuhiko. 2016. “On the Effect of Weighting Matrix in GMM Specification Test.” Journal of Statistical Planning and Inference 178 (November):84–98. https://doi.org/10.1016/j.jspi.2016.06.003.
Prokhorov, Artem. 2009. “On Relative Efficiency of Quasi-MLE and GMM Estimators of Covariance Structure Models.” Economics Letters 102 (1):4–6. https://doi.org/10.1016/j.econlet.2008.08.019.

edit
I found LISREL book that has a summary with collection of formulas in the appendix
16.1.5 General Covariance Structures (in appendix F)
Jöreskog, Karl G., Ulf H. Olsson, and Fan Y. Wallentin. 2016. Multivariate Analysis with LISREL. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-33153-9.
includes also multivariate Mardia skew and kurtosis, chapter 12 appendix B

@coveralls
Copy link

coveralls commented Nov 27, 2017

Coverage Status

Coverage decreased (-0.2%) to 81.72% when pulling c90fa70 on josef-pkt:cov_matrix_tools into 37f7e05 on statsmodels:master.

@codecov-io
Copy link

codecov-io commented Nov 27, 2017

Codecov Report

Merging #4143 into master will decrease coverage by 0.01%.
The diff coverage is 75.58%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4143      +/-   ##
=========================================
- Coverage   79.31%   79.3%   -0.02%     
=========================================
  Files         546     548       +2     
  Lines       81806   82105     +299     
  Branches     9268    9288      +20     
=========================================
+ Hits        64887   65113     +226     
- Misses      14824   14890      +66     
- Partials     2095    2102       +7
Impacted Files Coverage Δ
statsmodels/stats/tests/test_cov_tools.py 100% <100%> (ø)
statsmodels/stats/_cov_tools.py 57.8% <57.8%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37f7e05...ef29651. Read the comment docs.

@josef-pkt
Copy link
Member Author

an idea how to organize the matrix tools and reuse the pieces is to create a SymmTools class

e.g.
indexing and labels for the operators and resulting vectors or arrays. It is difficult to keep track of what element refers to which product in 2-D and 4-D partially raveled and selected arrays. Panda DataFrame would help for results.

e.g. using indexing and mask instead of linalg operators plus labels

# working with indices and masks, e.g. select lower from vec
kv = 3
order = 'F'
ravel_idx_mom4 = np.mgrid[:kv, :kv, :kv, :kv].reshape((4, -1), order=order).T
mg2 = np.mgrid[:kv, :kv].reshape((2, -1), order=order).T
mask_low = (np.diff(mg2, 1) < 0).squeeze()
idx2_dict = dict((tuple(rii.tolist()), ii) for ii, rii in enumerate(mg2))

# wrong nobs in computation needs compensation
covcorr = ct._cov_corr(corr_r, ct._cov_cov(corr_r, 20)*20, 20) * 20

# extract an individual covariance of correlation coefficients
covcorr[idx2_dict[(1, 2)], idx2_dict[(1, 2)]]


# add index or labels to DataFrame
import pandas as pd
mdlabels = list(map(tuple, mg2[mask_low]))  #convert to list of tuples
pd.DataFrame(covcorr[mask_low][:, mask_low], index=mdlabels, columns=mdlabels)
"""
          (0, 1)    (0, 2)    (1, 2)
(0, 1)  0.170123  0.138327  0.028239
(0, 2)  0.138327  0.170123  0.028239
(1, 2)  0.028239  0.028239  0.024656
"""

cre = ct._cov_corr_elliptical_vech(corr_r, 1, kurt=0)
dd = Dup(3)
di = np.linalg.pinv(dd)
#(di.dot(cre).dot(di.T))[np.array([[1,2,4]]).T, np.array([[1,2,4]])] / covcorr[mask_low][:, mask_low]
print(cre[np.array([[1,2,4]]).T, np.array([[1,2,4]])] / covcorr[mask_low][:, mask_low])
# Note create index arrays

# selector masks vor vec, vech and veclow
mask_vech = (np.diff(mg2, 1) <= 0).squeeze()
mg2vech = mg2[mask_vech]
mask_vech2veclow = (np.diff(mg2vech, 1) < 0).squeeze()
print(cre[np.ix_(mask_vech2veclow, mask_vech2veclow)])
cr2 = di.dot(ct._cov_corr(corr_r, ct._cov_cov(corr_r, 1), 1)).dot(di.T)[np.ix_(mask_vech2veclow, mask_vech2veclow)]    
# same as
# Note we need to set nobs=1 in call to _cov_cov  Bug or Definition?
ct._cov_corr(corr_r, ct._cov_cov(corr_r, 1), 1)[np.ix_(mask_low, mask_low)]

# variance of correlation coefficients under normality, excluded diagonal elements with var=0
((1 - vech(corr_r)**2)**2)[mask_vech2veclow] 
((1 - veclow(corr_r)**2)**2)

var0, var1, covar, nobs = g1, g2, g3, 20
ztest_correlated(diff, var0, var1, covar, nobs)
(-2.3642128782748038, 0.018068426888646225)
zt1 = -2.30
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ztest is an application in Steiger and Hackstian. Not converted to unit tests yet, and no other support for hypothesis tests yet.

assert_equal(cr1.shape, (n_low, n_low))
assert_allclose(cr2, cr1, rtol=1e-14)
assert_allclose(cr3, cr1, rtol=1e-14)
assert_allclose(cr4, cr1, rtol=1e-14)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two underlying cov_corr functions, the other variation is on selection and selection options to get to minimal, unique correlations
missing comparison to steiger hakstian example, single element function and reference numbers, only tested for normal case, elliptical with small kurtosis should be close. _cov_corr works for arbitrary cov_cov but scaling of mom4 is not checked (1/nobs) for consistency.



def cov_cov_fisherz(corr, cov_cov):
"""covariance of Fisher's Z-transformed correlation matrix
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used yet, will be needed for Fisher transformed cases of hypothesis tests

return m4 / (nobs - ddof)


def _mom4_normal(i, j, k, h, cov):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useful mainly for cross-checking, in most cases we want the kronecker product version

return ii.ravel(order='C'), jj.ravel(order='C')


def _cov_cov(cov, nobs, assume='elliptical', kurt=0, data=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace data by mom4

@josef-pkt
Copy link
Member Author

josef-pkt commented Nov 29, 2017

(all red, I fixed a typo in the unit tests, wrong matrix used, and amended and force pushed)

This finishes now the basic tools part, unit tests are still incomplete.

the next parts will be two application and one additional tools refactoring or enhancement (but not now):

  • hypothesis tests, especially for linear restrictions on the correlation coefficients, two articles by Steiger and we can avoid now his quadruple indices.
  • GLS/GMM estimation where we have now the cov/weight matrices (articles by Browne)
  • computational efficiency: I expect that either sparse matrix tools or replacing them by indexing/masks will be a more efficient solution. Additionally, in case like elliptical distributions, including normal, there are special case formulas that avoid nobs**2 by nobs**2 arrays (like mom4)

Some defaults or options like whether to divide by nobs or not, can wait until we have applications and see what's more convenient.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 81.864% when pulling ef29651 on josef-pkt:cov_matrix_tools into 37f7e05 on statsmodels:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 81.864% when pulling ef29651 on josef-pkt:cov_matrix_tools into 37f7e05 on statsmodels:master.

@josef-pkt
Copy link
Member Author

One piece that I haven't figured out yet: What's the impact of using standardized variables to compute cov_corr. Neudecker uses the cov as starting point in the general case, _cov_corr needs the raw data or mom4 of the raw data in the non-elliptical case.
However, Steiger and Browne use in same places standardized mom4

It looks like it might be possible to rewrite Neudecker and Wesselman to use only corr and covcov based on standardized mom4.

The example illustrates the scale invariance of cov_corr:

with raw data x and cov

>>> ct._cov_corr(m2, ct._cov_cov_mom4(m2, ct.mom4(x - x.mean(0), ddof=0), 1), 1)[np.ix_(mask_low, mask_low)]
array([[ 0.20742761,  0.06664495,  0.08843487],
       [ 0.06664495,  0.04776686,  0.03091085],
       [ 0.08843487,  0.03091085,  0.04860455]])

with standardized data z and corr

>>> ct._cov_corr(corr, ct._cov_cov_mom4(corr, ct.mom4(z, ddof=0), 1), 1)[np.ix_(mask_low, mask_low)]

array([[ 0.20742761,  0.06664495,  0.08843487],
       [ 0.06664495,  0.04776686,  0.03091085],
       [ 0.08843487,  0.03091085,  0.04860455]])

how to replicate cov_corr at a constraint estimate corr_r (from test case)
compute m2r: the cov under correlation restriction
mom4 remains unrestricted

>>> std = np.sqrt(np.diag(m2))
>>> m2r = corr_r*std[:,None]*std
>>> ct._cov_corr(m2r, ct._cov_cov_mom4(m2r, ct.mom4(x - x.mean(0), ddof=1), 1), 1)[np.ix_(mask_low, mask_low)]
array([[ 0.20200122,  0.04860593,  0.0775348 ],
       [ 0.04860593,  0.10611046,  0.04848629],
       [ 0.0775348 ,  0.04848629,  0.05116268]])

>>> i, j, k, h = 0, 1, 0, 1
>>> g1 = cov_corr_coef(i, j, k, h, corr_r, m4)
>>> g1
0.20200121658014503

that means we can (not so *) easily compute quadratic forms for goodness of fit or tests of constrained correlations, e.g. for

>>> diff = veclow(corr) - veclow(corr_r) 
>>> diff
array([-0.06665687,  0.06665687,  0.        ])

*) I don't manage right now, see e.g. Brown 1984 eq (2.20b) Proposition 4

this approximately reproduces the Steiger Hakstian example, continuing above
the test value I got is -1.2978383834464489, SH have zt1 = -1.25

>>> crr = ct._cov_corr(m2r, ct._cov_cov_mom4(m2r, ct.mom4(x - x.mean(0), ddof=1), 1), 1)[np.ix_(mask_low, mask_low)]
>>> cons = np.array([-1,  1,  0. ])
>>> (corr[0, 1] - corr[0, 2]) / np.sqrt(cons.dot(crr/20).dot(cons))
-1.2982304211438958

@josef-pkt
Copy link
Member Author

josef-pkt commented Nov 29, 2017

Just seen this (after a question about unfinished SVAR parts)

    def _compute_J(self, A_solve, B_solve):

        #first compute appropriate duplication matrix
        # taken from Magnus and Neudecker (1980),
        #"The Elimination Matrix: Some Lemmas and Applications
        # the creation of the D_n matrix follows MN (1980) directly,
        #while the rest follows Hamilton (1994)

It doesn't say what J is but SVAR computes a cov_resid with constrained zeros under normal distribution assumption (MLE).
I never went though those SVAR details, but it's similar to estimating a linear model with restricted cov_structure by MLE. We should get some overlapping code, when the extensions here are finished.

edit
J is only used for checking whether there are enough identifying restrictions on A and B
The MLE is for a static problem after the lag effects have been removed by OLS (equivalent to SUR)
A y_t = B u_t where u_t are resid from the unrestricted VAR/OLS regression.
The returned params for the lag coefficients var_params is from the unrestricted VAR regression.

@josef-pkt
Copy link
Member Author

Notation: Bai Li referenced in #4156 use veck for the lower triangular vec without diagonal. It's the only place where I saw a shorthand name for veck.
They also have an index construction of the vech duplication matrix

@jbrockmendel
Copy link
Contributor

Not sure if this is the right place, but suggestion/request for checks/assertions on covariance matrices being positive-semidefinite. A failure mode I see occasionally involves this failing due to floating point error (often but not always accompanied by convergence failure)

@josef-pkt
Copy link
Member Author

about positive definiteness
We have some helper function to assure or convert to positive (semi-)definite matrices in stats.correlation_tools.

What's the context?
If it is for optimization in the models, then it is a different question. Kevin added a positive definiteness check instead of non-invertibility/matrix_rank check for the final cov_params.
In my experience (in the old times with GAUSS), doing a eigenvalue decomposition during optimization is too expensive, lbfgs or bfgs should work better than plain newton.

@josef-pkt
Copy link
Member Author

It would be good to get this merged as private, experimental tools. Then, we can gradually write functions on top of it.
e.g. hypo tests for individual correlation coefficients, or element wise hypothesis tests for covariance/correlation matrices.

Also. this is one of the main PRs that uses kurtosis correction, i.e. extending cov/corr inference to elliptical distributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-multivariate comp-stats comp-tools superseded Used for PRs with content that is superseded by a newer PR type-enh
Projects
topic-covariance
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

4 participants