WIP ENH: cov and matrix tools to work with covariances and correlations #4143

josef-pkt · 2017-11-27T06:02:54Z

a semi-random collection of matrix function and estimators for the covariance of sample correlation and covariance matrices.

very unfinished. I didn't even try out all functions. API needs a lot of work. This is partially an "academic" exercise understanding how to work with and compute the asymptotic covariance of covariance and correlation matrices.

In contrast to #2765, I didn't find much in multivariate textbooks and not in Stata. There is a psych package in R that has many correlation functions or hypothesis tests. This PR is written based on a collection of articles going back to 1970. (references will follow)

I'm not sure yet about how and where this fits in. currently everything is in private module cov_tools. Overall this goes in the direction of modeling second moments, so far statsmodels is mostly focused on first moments.

include matrix tools ENH: matrix tools for covariances #4140 (names not informative nor pep-8 compliant, written based on the first article that I looked at when starting to write code. Neudecker changes the capital letter for various matrices across articles)
cov_cov and cov_corr: asymptotic covariance of a covariance and correlation estimator targeting various patters/structures or linear restrictions on those. Main assumptions are one of normal, elliptical and distribution free, which in this case affects the assumptions on the 4th moments.
eventually the target is structured covariance and correlation estimators with either MLE, iterated GLS and GMM. They are all similar but but use different weighting matrices and the assumptions that underly those.
hypothesis tests on correlation coefficients: Once we have the cov_corr, the rest is relatively standard.
hypothesis tests on patterns of cov and corr: Most hypothesis tests in SUMM/ENH: hypothesis tests for covariance/correlation matrices #2765 is base on likelihood ratio tests. The literature underlying this PR goes more in the direction of score/LM and Wald tests.

related and not in here: GMM/unrestricted 4th moments can have bad small sample properties. There are some articles that use F instead of the usual chisquare distribution to improve this.

related, but nothing available yet: kurtosis measures to have better (approximation to) asymptotic distribution of test statistic in elliptically symmetric case, e.g. #3280 (several, many articles for multivariate methods). (I don't know if we need multivariate Mardia kurtosis or some average univariate measure)

list of references (incomplete, but most of what I looked at)

Magnus, Jan R., and H. Neudecker. 1986. “Symmetry, 0-1 Matrices and Jacobians: A Review.” Econometric Theory 2 (2):157–90.
Neudecker, H., and A. M. Wesselman. 1990. “The Asymptotic Variance Matrix of the Sample Correlation Matrix.” Linear Algebra and Its Applications 127 (January):589–99. https://doi.org/10.1016/0024-3795(90)90363-H.
Neudecker, Heinz. 1996. “The Asymptotic Variance Matrices of the Sample Correlation Matrix in Elliptical and Normal Situations and Their Proportionality.” Linear Algebra and Its Applications, Linear Algebra and Statistics: In Celebration of C. R. Rao’s 75th Birthday (September 10, 1995), 237 (April):127–32. https://doi.org/10.1016/0024-3795(95)00351-7.
Satorra, Albert, and Heinz Neudecker. 1997. “Compact Matrix Expressions for Generalized Wald Tests of Equality of Moment Vectors.” Journal of Multivariate Analysis 63 (2):259–76. https://doi.org/10.1006/jmva.1997.1696.
Steiger, James H., and A. Ralph Hakstian. 1982. “The Asymptotic Distribution of Elements of a Correlation Matrix: Theory and Application.” British Journal of Mathematical and Statistical Psychology 35 (2):208–15. https://doi.org/10.1111/j.2044-8317.1982.tb00653.x.

Browne, M. W. 1974. “Generalized Least Squares Estimators in the Analysis of Covariance Structures.” South African Statistical Journal 8 (1):1–24.
———. 1977. “The Analysis of Patterned Correlation Matrices by Generalized Least Squares.” British Journal of Mathematical and Statistical Psychology 30 (1):113–24. https://doi.org/10.1111/j.2044-8317.1977.tb00730.x.
———. 1984. “Asymptotically Distribution-Free Methods for the Analysis of Covariance Structures.” British Journal of Mathematical and Statistical Psychology 37 (1):62–83. https://doi.org/10.1111/j.2044-8317.1984.tb00789.x.
Huang, Yafei, and Peter M. Bentler. 2015. “Behavior of Asymptotically Distribution Free Test Statistics in Covariance Versus Correlation Structure Analysis.” Structural Equation Modeling: A Multidisciplinary Journal 22 (4):489–503. https://doi.org/10.1080/10705511.2014.954078.
Steiger, James H. 1980a. “Tests for Comparing Elements of a Correlation Matrix.” Psychological Bulletin 87 (2):245–51. https://doi.org/10.1037/0033-2909.87.2.245.
———. 1980b. “Testing Pattern Hypotheses On Correlation Matrices: Alternative Statistics And Some Empirical Results.” Multivariate Behavioral Research 15 (3):335–52. https://doi.org/10.1207/s15327906mbr1503_7.

Yuan, K.-H., and P. M. Bentler. 1999. “F Tests for Mean and Covariance Structure Analysis.” Journal of Educational and Behavioral Statistics 24 (3):225–43. https://doi.org/10.3102/10769986024003225.

Hayakawa, Kazuhiko. 2016. “On the Effect of Weighting Matrix in GMM Specification Test.” Journal of Statistical Planning and Inference 178 (November):84–98. https://doi.org/10.1016/j.jspi.2016.06.003.
Prokhorov, Artem. 2009. “On Relative Efficiency of Quasi-MLE and GMM Estimators of Covariance Structure Models.” Economics Letters 102 (1):4–6. https://doi.org/10.1016/j.econlet.2008.08.019.

edit
I found LISREL book that has a summary with collection of formulas in the appendix
16.1.5 General Covariance Structures (in appendix F)
Jöreskog, Karl G., Ulf H. Olsson, and Fan Y. Wallentin. 2016. Multivariate Analysis with LISREL. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-33153-9.
includes also multivariate Mardia skew and kurtosis, chapter 12 appendix B

coveralls · 2017-11-27T06:55:08Z

Coverage decreased (-0.2%) to 81.72% when pulling c90fa70 on josef-pkt:cov_matrix_tools into 37f7e05 on statsmodels:master.

codecov-io · 2017-11-27T06:55:42Z

Codecov Report

Merging #4143 into master will decrease coverage by 0.01%.
The diff coverage is 75.58%.

@@            Coverage Diff            @@
##           master   #4143      +/-   ##
=========================================
- Coverage   79.31%   79.3%   -0.02%     
=========================================
  Files         546     548       +2     
  Lines       81806   82105     +299     
  Branches     9268    9288      +20     
=========================================
+ Hits        64887   65113     +226     
- Misses      14824   14890      +66     
- Partials     2095    2102       +7

Impacted Files	Coverage Δ
statsmodels/stats/tests/test_cov_tools.py	`100% <100%> (ø)`
statsmodels/stats/_cov_tools.py	`57.8% <57.8%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37f7e05...ef29651. Read the comment docs.

josef-pkt · 2017-11-29T00:10:35Z

an idea how to organize the matrix tools and reuse the pieces is to create a SymmTools class

e.g.
indexing and labels for the operators and resulting vectors or arrays. It is difficult to keep track of what element refers to which product in 2-D and 4-D partially raveled and selected arrays. Panda DataFrame would help for results.

e.g. using indexing and mask instead of linalg operators plus labels

# working with indices and masks, e.g. select lower from vec
kv = 3
order = 'F'
ravel_idx_mom4 = np.mgrid[:kv, :kv, :kv, :kv].reshape((4, -1), order=order).T
mg2 = np.mgrid[:kv, :kv].reshape((2, -1), order=order).T
mask_low = (np.diff(mg2, 1) < 0).squeeze()
idx2_dict = dict((tuple(rii.tolist()), ii) for ii, rii in enumerate(mg2))

# wrong nobs in computation needs compensation
covcorr = ct._cov_corr(corr_r, ct._cov_cov(corr_r, 20)*20, 20) * 20

# extract an individual covariance of correlation coefficients
covcorr[idx2_dict[(1, 2)], idx2_dict[(1, 2)]]


# add index or labels to DataFrame
import pandas as pd
mdlabels = list(map(tuple, mg2[mask_low]))  #convert to list of tuples
pd.DataFrame(covcorr[mask_low][:, mask_low], index=mdlabels, columns=mdlabels)
"""
          (0, 1)    (0, 2)    (1, 2)
(0, 1)  0.170123  0.138327  0.028239
(0, 2)  0.138327  0.170123  0.028239
(1, 2)  0.028239  0.028239  0.024656
"""

cre = ct._cov_corr_elliptical_vech(corr_r, 1, kurt=0)
dd = Dup(3)
di = np.linalg.pinv(dd)
#(di.dot(cre).dot(di.T))[np.array([[1,2,4]]).T, np.array([[1,2,4]])] / covcorr[mask_low][:, mask_low]
print(cre[np.array([[1,2,4]]).T, np.array([[1,2,4]])] / covcorr[mask_low][:, mask_low])
# Note create index arrays

# selector masks vor vec, vech and veclow
mask_vech = (np.diff(mg2, 1) <= 0).squeeze()
mg2vech = mg2[mask_vech]
mask_vech2veclow = (np.diff(mg2vech, 1) < 0).squeeze()
print(cre[np.ix_(mask_vech2veclow, mask_vech2veclow)])
cr2 = di.dot(ct._cov_corr(corr_r, ct._cov_cov(corr_r, 1), 1)).dot(di.T)[np.ix_(mask_vech2veclow, mask_vech2veclow)]    
# same as
# Note we need to set nobs=1 in call to _cov_cov  Bug or Definition?
ct._cov_corr(corr_r, ct._cov_cov(corr_r, 1), 1)[np.ix_(mask_low, mask_low)]

# variance of correlation coefficients under normality, excluded diagonal elements with var=0
((1 - vech(corr_r)**2)**2)[mask_vech2veclow] 
((1 - veclow(corr_r)**2)**2)

josef-pkt · 2017-11-29T01:28:00Z

statsmodels/stats/tests/test_cov_tools.py

+    var0, var1, covar, nobs = g1, g2, g3, 20
+    ztest_correlated(diff, var0, var1, covar, nobs)
+    (-2.3642128782748038, 0.018068426888646225)
+    zt1 = -2.30


the ztest is an application in Steiger and Hackstian. Not converted to unit tests yet, and no other support for hypothesis tests yet.

josef-pkt · 2017-11-29T01:36:02Z

statsmodels/stats/tests/test_cov_tools.py

+    assert_equal(cr1.shape, (n_low, n_low))
+    assert_allclose(cr2, cr1, rtol=1e-14)
+    assert_allclose(cr3, cr1, rtol=1e-14)
+    assert_allclose(cr4, cr1, rtol=1e-14)


there are two underlying cov_corr functions, the other variation is on selection and selection options to get to minimal, unique correlations
missing comparison to steiger hakstian example, single element function and reference numbers, only tested for normal case, elliptical with small kurtosis should be close. _cov_corr works for arbitrary cov_cov but scaling of mom4 is not checked (1/nobs) for consistency.

josef-pkt · 2017-11-29T01:37:43Z

statsmodels/stats/_cov_tools.py

+
+
+def cov_cov_fisherz(corr, cov_cov):
+    """covariance of Fisher's Z-transformed correlation matrix


not used yet, will be needed for Fisher transformed cases of hypothesis tests

josef-pkt · 2017-11-29T01:39:28Z

statsmodels/stats/_cov_tools.py

+    return m4 / (nobs - ddof)
+
+
+def _mom4_normal(i, j, k, h, cov):


useful mainly for cross-checking, in most cases we want the kronecker product version

josef-pkt · 2017-11-29T01:42:23Z

statsmodels/stats/_cov_tools.py

+    return ii.ravel(order='C'), jj.ravel(order='C')
+
+
+def _cov_cov(cov, nobs, assume='elliptical', kurt=0, data=None):


replace data by mom4

josef-pkt · 2017-11-29T02:09:30Z

(all red, I fixed a typo in the unit tests, wrong matrix used, and amended and force pushed)

This finishes now the basic tools part, unit tests are still incomplete.

the next parts will be two application and one additional tools refactoring or enhancement (but not now):

hypothesis tests, especially for linear restrictions on the correlation coefficients, two articles by Steiger and we can avoid now his quadruple indices.
GLS/GMM estimation where we have now the cov/weight matrices (articles by Browne)
computational efficiency: I expect that either sparse matrix tools or replacing them by indexing/masks will be a more efficient solution. Additionally, in case like elliptical distributions, including normal, there are special case formulas that avoid nobs**2 by nobs**2 arrays (like mom4)

Some defaults or options like whether to divide by nobs or not, can wait until we have applications and see what's more convenient.

coveralls · 2017-11-29T02:53:59Z

Coverage decreased (-0.01%) to 81.864% when pulling ef29651 on josef-pkt:cov_matrix_tools into 37f7e05 on statsmodels:master.

coveralls · 2017-11-29T02:53:59Z

Coverage decreased (-0.01%) to 81.864% when pulling ef29651 on josef-pkt:cov_matrix_tools into 37f7e05 on statsmodels:master.

josef-pkt · 2017-11-29T04:38:46Z

One piece that I haven't figured out yet: What's the impact of using standardized variables to compute cov_corr. Neudecker uses the cov as starting point in the general case, _cov_corr needs the raw data or mom4 of the raw data in the non-elliptical case.
However, Steiger and Browne use in same places standardized mom4

It looks like it might be possible to rewrite Neudecker and Wesselman to use only corr and covcov based on standardized mom4.

The example illustrates the scale invariance of cov_corr:

with raw data x and cov

>>> ct._cov_corr(m2, ct._cov_cov_mom4(m2, ct.mom4(x - x.mean(0), ddof=0), 1), 1)[np.ix_(mask_low, mask_low)]
array([[ 0.20742761,  0.06664495,  0.08843487],
       [ 0.06664495,  0.04776686,  0.03091085],
       [ 0.08843487,  0.03091085,  0.04860455]])

with standardized data z and corr

>>> ct._cov_corr(corr, ct._cov_cov_mom4(corr, ct.mom4(z, ddof=0), 1), 1)[np.ix_(mask_low, mask_low)]

array([[ 0.20742761,  0.06664495,  0.08843487],
       [ 0.06664495,  0.04776686,  0.03091085],
       [ 0.08843487,  0.03091085,  0.04860455]])

how to replicate cov_corr at a constraint estimate corr_r (from test case)
compute m2r: the cov under correlation restriction
mom4 remains unrestricted

>>> std = np.sqrt(np.diag(m2))
>>> m2r = corr_r*std[:,None]*std
>>> ct._cov_corr(m2r, ct._cov_cov_mom4(m2r, ct.mom4(x - x.mean(0), ddof=1), 1), 1)[np.ix_(mask_low, mask_low)]
array([[ 0.20200122,  0.04860593,  0.0775348 ],
       [ 0.04860593,  0.10611046,  0.04848629],
       [ 0.0775348 ,  0.04848629,  0.05116268]])

>>> i, j, k, h = 0, 1, 0, 1
>>> g1 = cov_corr_coef(i, j, k, h, corr_r, m4)
>>> g1
0.20200121658014503

that means we can (not so *) easily compute quadratic forms for goodness of fit or tests of constrained correlations, e.g. for

>>> diff = veclow(corr) - veclow(corr_r) 
>>> diff
array([-0.06665687,  0.06665687,  0.        ])

*) I don't manage right now, see e.g. Brown 1984 eq (2.20b) Proposition 4

this approximately reproduces the Steiger Hakstian example, continuing above
the test value I got is -1.2978383834464489, SH have zt1 = -1.25

>>> crr = ct._cov_corr(m2r, ct._cov_cov_mom4(m2r, ct.mom4(x - x.mean(0), ddof=1), 1), 1)[np.ix_(mask_low, mask_low)]
>>> cons = np.array([-1,  1,  0. ])
>>> (corr[0, 1] - corr[0, 2]) / np.sqrt(cons.dot(crr/20).dot(cons))
-1.2982304211438958

josef-pkt · 2017-11-29T13:43:41Z

Just seen this (after a question about unfinished SVAR parts)

    def _compute_J(self, A_solve, B_solve):

        #first compute appropriate duplication matrix
        # taken from Magnus and Neudecker (1980),
        #"The Elimination Matrix: Some Lemmas and Applications
        # the creation of the D_n matrix follows MN (1980) directly,
        #while the rest follows Hamilton (1994)

It doesn't say what J is but SVAR computes a cov_resid with constrained zeros under normal distribution assumption (MLE).
I never went though those SVAR details, but it's similar to estimating a linear model with restricted cov_structure by MLE. We should get some overlapping code, when the extensions here are finished.

edit
J is only used for checking whether there are enough identifying restrictions on A and B
The MLE is for a static problem after the lag effects have been removed by OLS (equivalent to SUR)
A y_t = B u_t where u_t are resid from the unrestricted VAR/OLS regression.
The returned params for the lag coefficients var_params is from the unrestricted VAR regression.

josef-pkt · 2017-12-08T15:58:21Z

Notation: Bai Li referenced in #4156 use veck for the lower triangular vec without diagonal. It's the only place where I saw a shorthand name for veck.
They also have an index construction of the vech duplication matrix

jbrockmendel · 2017-12-21T04:58:18Z

Not sure if this is the right place, but suggestion/request for checks/assertions on covariance matrices being positive-semidefinite. A failure mode I see occasionally involves this failing due to floating point error (often but not always accompanied by convergence failure)

josef-pkt · 2017-12-21T13:09:31Z

about positive definiteness
We have some helper function to assure or convert to positive (semi-)definite matrices in stats.correlation_tools.

What's the context?
If it is for optimization in the models, then it is a different question. Kevin added a positive definiteness check instead of non-invertibility/matrix_rank check for the final cov_params.
In my experience (in the old times with GAUSS), doing a eigenvalue decomposition during optimization is too expensive, lbfgs or bfgs should work better than plain newton.

josef-pkt · 2020-05-07T14:45:30Z

It would be good to get this merged as private, experimental tools. Then, we can gradually write functions on top of it.
e.g. hypo tests for individual correlation coefficients, or element wise hypothesis tests for covariance/correlation matrices.

Also. this is one of the main PRs that uses kurtosis correction, i.e. extending cov/corr inference to elliptical distributions.

ENH: cov and matrix tools to work with covariances and correlations

c90fa70

josef-pkt added comp-multivariate comp-stats comp-tools type-enh labels Nov 27, 2017

josef-pkt added this to the 0.10 milestone Nov 27, 2017

josef-pkt mentioned this pull request Nov 28, 2017

SUMM/Reference robust cov_type for covariance analysis #4144

Open

josef-pkt commented Nov 29, 2017

View reviewed changes

BUG, TST: fix some of the problems, add unit tests including cov_corr

ef29651

josef-pkt force-pushed the cov_matrix_tools branch from cd72aaa to ef29651 Compare November 29, 2017 02:07

This was referenced Dec 1, 2017

ENH: covariance/correlation tests in high dimension #4148

Open

ENH: GMM/QIF(GEE) regularized weight matrix #2942

Open

[MRG] Add factor analysis #3294

Closed

josef-pkt mentioned this pull request Dec 9, 2017

ENH/SUMM: Factor Models - inference for factor analysis #4159

Open

josef-pkt mentioned this pull request Apr 14, 2018

tsatools.vech redundant with vector_ar.util.vech? #4448

Open

josef-pkt mentioned this pull request Apr 14, 2018

var_model test_causality vs test_inst_causality redundant code? #4463

Closed

josef-pkt mentioned this pull request Oct 20, 2018

tools.tools namespace is counter-intuitive #4992

Open

josef-pkt mentioned this pull request Jan 2, 2019

DOC/Design: RegularizedInvCovariance in statsmodels.stats #5443

Open

josef-pkt mentioned this pull request May 6, 2020

SUMM: organizing basic multivariate stats #6689

Open

josef-pkt mentioned this pull request May 8, 2020

ENH: cov and matrix tools to work with covariances and correlations - rebased #6696

Open

josef-pkt added the superseded Used for PRs with content that is superseded by a newer PR label Mar 15, 2023

josef-pkt mentioned this pull request Jun 9, 2023

FAQ-D, design: multivariate endog, layout, shape overview #8904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP ENH: cov and matrix tools to work with covariances and correlations #4143

WIP ENH: cov and matrix tools to work with covariances and correlations #4143

josef-pkt commented Nov 27, 2017 •

edited

coveralls commented Nov 27, 2017 •

edited

codecov-io commented Nov 27, 2017 •

edited

josef-pkt commented Nov 29, 2017

josef-pkt Nov 29, 2017

josef-pkt Nov 29, 2017

josef-pkt Nov 29, 2017

josef-pkt Nov 29, 2017

josef-pkt Nov 29, 2017

josef-pkt commented Nov 29, 2017 •

edited

coveralls commented Nov 29, 2017

coveralls commented Nov 29, 2017

josef-pkt commented Nov 29, 2017

josef-pkt commented Nov 29, 2017 •

edited

josef-pkt commented Dec 8, 2017

jbrockmendel commented Dec 21, 2017

josef-pkt commented Dec 21, 2017

josef-pkt commented May 7, 2020



		def cov_cov_fisherz(corr, cov_cov):
		"""covariance of Fisher's Z-transformed correlation matrix

		return ii.ravel(order='C'), jj.ravel(order='C')


		def _cov_cov(cov, nobs, assume='elliptical', kurt=0, data=None):

WIP ENH: cov and matrix tools to work with covariances and correlations #4143

Are you sure you want to change the base?

WIP ENH: cov and matrix tools to work with covariances and correlations #4143

Conversation

josef-pkt commented Nov 27, 2017 • edited

coveralls commented Nov 27, 2017 • edited

codecov-io commented Nov 27, 2017 • edited

Codecov Report

josef-pkt commented Nov 29, 2017

josef-pkt Nov 29, 2017

Choose a reason for hiding this comment

josef-pkt Nov 29, 2017

Choose a reason for hiding this comment

josef-pkt Nov 29, 2017

Choose a reason for hiding this comment

josef-pkt Nov 29, 2017

Choose a reason for hiding this comment

josef-pkt Nov 29, 2017

Choose a reason for hiding this comment

josef-pkt commented Nov 29, 2017 • edited

coveralls commented Nov 29, 2017

coveralls commented Nov 29, 2017

josef-pkt commented Nov 29, 2017

josef-pkt commented Nov 29, 2017 • edited

josef-pkt commented Dec 8, 2017

jbrockmendel commented Dec 21, 2017

josef-pkt commented Dec 21, 2017

josef-pkt commented May 7, 2020

josef-pkt commented Nov 27, 2017 •

edited

coveralls commented Nov 27, 2017 •

edited

codecov-io commented Nov 27, 2017 •

edited

josef-pkt commented Nov 29, 2017 •

edited

josef-pkt commented Nov 29, 2017 •

edited