Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enhance _MultivariateOLS, MANOVA, code duplication, #8722

Open
josef-pkt opened this issue Mar 7, 2023 · 5 comments
Open

ENH: enhance _MultivariateOLS, MANOVA, code duplication, #8722

josef-pkt opened this issue Mar 7, 2023 · 5 comments

Comments

@josef-pkt
Copy link
Member

I thought MANOVA is using _MultivariateOLS.
However, it looks like they share code, helper functions, but manova doesn't reuse the _MultivariateOLS class.
There is also quite a bit of code duplication.

I was looking for access to the _MultivariateOLS instance in the MANOVA and it's test result instances, but it's not available.

_MultivariateOLS does not have a summary implemented, which makes it difficult to get a quick overview of results.

context #8713 trying to figure out usage and problems with multi-way manova.

based on an example: _MultivariateOLS runs an identical test to MANOVA

formula = 'PC1 + PC2 + PC3 + PC4 ~ C(Genotype, Helmert) * C(Temp, Helmert) * C(Time, Helmert)'
mod = _MultivariateOLS.from_formula(formula, data=p_df)
res = mod.fit()
tt = res.mv_test()

but res does not have any of the usual results attributes and methods, not even params

[i for i in dir(res) if not i.startswith("__")]
['_fittedmod',
 'design_info',
 'endog_names',
 'exog_names',
 'mv_test',
 'summary']
@josef-pkt
Copy link
Member Author

I'm trying to figure out more generally what we need for Multivariate linear model.

Do the inferential results differ from OLS with cluster robust standard errors?
The params will be the same, and I guess cov_params will be the same or similar (except for df, small sample corrections)
What are the "rank" conditions between MultivariateOLS and OLS with cluster robust standard errors.

MultivariateOLS might be a misnomer if we add GLS inference. ie. only params and within inference are equivalent to OLS.
"MultivariateLinearModel" which might mean a likelihoodmodel, gaussian or quasi-gaussian
maybe "MultivariateLS"

Do we need a "blown up", memory inefficient version as reference, using kronecker product exog?
It would not be a memory problem in small samples as in experimental data.
But, I think we get into the SUR case if we allow for restrictions or penalization (#7255) of individual parameters.

What about GMM equivalent model?
Would not be to difficult with horizontal stacking of moment conditions, and robust cov_types would be inherited.

Note: this is all for balanced groups/panel case, i.e. same number of obs for each equation.

aside:
nice proof of equivalence of within cov_params is identical between single equation OLS and GLS
https://economics.stackexchange.com/questions/45753/seemingly-unrelated-regression-estimation-equivalent-to-ols-standard-errors
However, it does not look at cross-equation cov, cov(beta_i, beta_j) for i != j

@josef-pkt
Copy link
Member Author

josef-pkt commented Mar 10, 2023

Inference in Multivariate linear model "MultivariateGLS"?
same regressors for each endog.

this article looks looks useful, includes the eigenvalue based tests Rao, ...
and standard Wald on raveled params
for row-column hypothesis as in MANOVA

Stewart, Kenneth G. “Exact Testing in Multivariate Regression.” Econometric Reviews 16, no. 3 (January 1, 1997): 321–52. https://doi.org/10.1080/07474939708800390.

a quick try comparing t_test with mv_test
mvGLS t_test with mv_test: test statistic t**2 and F are very close, however not identical.
mvGLS t_test with single equation OLS t_test: test statistics, tvalues are identical but p-values are only close if use_t=True.

problem is how to define consistent df_resid

for _multivariateOLS, I used res.df_resid = nobs * k_groups - res.params.size (corresponding to long form of OLS/GLS)
single equation uses nobs - k_vars
aside: k_groups - k_params is negative in the example, k_params = k_groups * k_vars

stats analogue would be 2 or k paired, correlated samples. What's the df for t-test?
It should be nobs - 1 if we just t-test the observationwise (pair) diff

update
df_resid = nobs - k_vars looks better
justification would be that params are equivalent to single equation regression

The mv_test have df_denom (df_resid) that are neither of the two above.
In small sample Roy's greatest root test differs quite a bit from the other three, both in p-values and df_num, df_denom (for multi-parameter joint hypothesis)

aside: multivariate L B M hypothesis only allows for within equation hypotheses, AFAICS, but joint over all or several equations.
wrong M can do multi-equation comparison. The only restriction is that hypothesis are on a rectangular block of params.

@josef-pkt
Copy link
Member Author

josef-pkt commented Mar 10, 2023

aside: Roy's greatest root

df is not the same as in Steward 1997, it uses the max(p, q)

quote
"where r=max(p, q) is an upper bound on F that yields a lower bound on the significance level. Degrees of freedom are r for the numerator and v - r + q for the denominator. "
where "Let v be the error degrees of freedom"

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_introreg_sect038.htm#statug_introreg002005

    sigma = results.loc["Roy's greatest root", 'Value']
    r = np.max([p, q])
    df1 = r
    df2 = v - r + q
    F = df2 / df1 * sigma

@josef-pkt
Copy link
Member Author

aside:
I should add the analogue to wald_test_terms to MANOVA, MultivariateGLS
specifically all terms that involve a factor are zero under null

current MANOVA is type 3, i.e. main factor is tested in the model that also includes interaction terms

http://users.stat.umn.edu/~helwig/notes/aov2-Notes.pdf for univariate anova
p. 57 type 2 anova tests main effect in the model without interaction effect (section for unbalanced anova)
this is different from testing that both main and interaction effects are zero in full model.

@josef-pkt
Copy link
Member Author

back to the roots

Berndt, Ernst R., and N. Eugene Savin. “Conflict among Criteria for Testing Hypotheses in the Multivariate Linear Regression Model.” Econometrica 45, no. 5 (1977): 1263–77. https://doi.org/10.2307/1914072.

One application for multivariate models are cost and consumption share estimation.
This should get us closer to one of the original demands for multivariate regression in compositional analysis #3560
(related MNLogit does not handle fractional data, AFAIK)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant