Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: automatically building contrast/restriction matrices #4361

Open
josef-pkt opened this issue Mar 19, 2018 · 4 comments
Open

ENH: automatically building contrast/restriction matrices #4361

josef-pkt opened this issue Mar 19, 2018 · 4 comments

Comments

@josef-pkt
Copy link
Member

josef-pkt commented Mar 19, 2018

We have several open issues for this. e.g. #852, #1668, #2362
The attempt that is currently most integrated with patsy is in an extension of #2687 (comment) extension merged in #4365 t_test_pairwise

There is also preliminary code all over the place, but I never managed to figure out a general structure and API

two possibilities using patsy formula factor information and specifying "base contrasts or encoding" explicitly

patsy

patsy attaches the contrast matrix, cm below, used to build the encoding, #2687 We don't currently use that information.
AFAICS:

  • dot product of cm and params provides base parameterization and corresponding cov_params
  • cov_params will be singular, and I guess we need to check the rank (?) for estimable contrasts/restrictions
  • based on restriction value (diff) and reparameterized cov_params (where we need only submatrix of cov_params), we can work on selected effects/constraints

arrays and explicit formulas

what do we need?
description of current encoding, e.g. is there a reference level, what is the label for the reference level
labels, column/params indices

target contrast/restriction
mnemonic all pairs, ....

current: ???

  • sandbox.regression.penalized contains 3 function like coef_restriction_meandiff for structured penalization or stochastic constraints.
  • sandbox.stats.multicomp contains 3 functions like contrast_all_one (which is one of my early attempts)
  • sandbox.stats.contrast_tools is a more general attempt based on multicomp
  • stats.contrats has pre-statsmodel contrastfromcols to check full rank that I never figured out AFAIR
  • tools.tools has pre-statsmodels isestimable which I also never figured out
  • special case multivariate with MANOVA tests (2-D params, col and row hypothesis), no multiple testing AFAIR

related: contrasts/constructed exog for prediction, LSMEANS, margins, ... #2167 and hypothesis tests, inference on linear restrictions of predicted means (like treatment effects)

TODO/Usage:

partially based on what I started in #2687

  • get term and factor info from design_info
  • "normalize", i.e. get a base parameterization for "means", i.e. coefficients
  • can be subset cov_params, so we are independent of other params in model?
  • construct target restriction/contrast matrix
  • t_test
  • multiple testing p-value correction, corrected confidence intervals (not available yet, e.g. SUMM/FAQ-D: simultaneous confidence intervals #2955)
  • reporting, summary and plots

update in #4365 I combined step 2 "normalize" and step 4 target restriction matrix into a combined restriction matrix that applies to the actual parameterization in the model and can directly be used with t_test and similar.

implementation detail: Where do we put these things?

Aside:
issue ??? suggest using prediction instead of coefficients. This would work for linear prediction (e.g. in GLM) but nonlinear prediction of response would have different effect. Predicted values are likely less close to normal distributed than coefficients, predicted values depend on values of other exog as in margins or predict_at.
prediction and LSMEANS runs into the same problem as here in that we need to construct the auxiliary exog (i.e. artificial sample reflecting the factor levels)

@josef-pkt
Copy link
Member Author

annoying implementation detail:

We need a sign convention.

When we have an ordered set of levels, e.g. A, B, C, do we compute A - B or B - A in the pair comparison?
This looks currently inconsistent across older versions in sandbox.
The labels assigned to the pairs should be consistent with this in literal interpretation,
e.g. label is A-B, but values are B - A
that is, dash should not interpreted as minus.

res0 = get_pairwise(model, 'C(DETERGENT)', method=['hommel'], ignore=False, factor_labels='A B C D'.split())
print(res0.result_frame)
 
         coef  std err         t     P>|t|  Conf. Int. Low  Conf. Int. Upp.  \
A-B  2.000000  1.44658  1.382572  0.216055       -1.539653         5.539653   
A-C  4.666667  1.44658  3.226001  0.018001        1.127014         8.206319   
A-D -3.666667  1.44658 -2.534715  0.044396       -7.206319        -0.127014   
B-C -2.666667  1.44658 -1.843429  0.114831       -6.206319         0.872986   
B-D  5.666667  1.44658  3.917286  0.007826        2.127014         9.206319   
C-D  8.333333  1.44658  5.760715  0.001193        4.793681        11.872986   

     pvalue-hommel reject-hommel  
A-B       0.216055         False  
A-C       0.072003         False  
A-D       0.133189         False  
B-C       0.216055         False  
B-D       0.039132          True  
C-D       0.007157          True  

TukeyHSD displays with two columns group1 group2, and AFAICS (in what I'm doing now) does not create combined labels.
aside: AFAICS TukeyHSD does not create a results dataframe, no summary_frame. summary() returns a SimpleTable

@josef-pkt
Copy link
Member Author

josef-pkt commented Mar 19, 2018

storing computation for using design_info design matrix for base coefficients.

this works manually, but we don't have a standalone t_test function to do this automatically (maybe #3570 or others that I cannot find)
I added a FakeModel in #3570 (comment) that works for this example

# ### try work using patsy design_info constrast_matrix

# In[70]:
res = model
cols = np.arange(3, 6)
cov1 = res._results.cov_params(column=cols)

# In[71]:
p1 = res._results.params[cols]

# In[80]:
p1f = cmm.dot(p1)
p1f

# In[81]:
cov1f = cmm.dot(cov1).dot(cmm.T)
cov1f

# In[75]:
cov1

# In[87]:
import statsmodels.sandbox.stats.multicomp as mc
c4 = -mc.contrast_allpairs(4)
c4

# In[91]:
eff = c4.dot(p1f)
eff

# In[92]:
cov4f = c4.dot(cov1f).dot(c4.T)
bse = np.sqrt(np.diag(cov4f))

# In[90]:
tt.effect

# In[102]:
from scipy import stats
stats.t.sf(np.abs(eff)/bse, res.df_resid)*2

# In[94]:
tt.pvalue

# In[100]:
tt.df_denom

# In[101]:
res.df_resid

@josef-pkt
Copy link
Member Author

parking two list comprehensions, I don't know if they are useful somewhere

analogue of triu, all pairwise comparisons

>>> cat = ['a', 'b', 'c', 'ddd', 'e']
>>> cat = list(range(5))
>>> for x in [(i, j) for idx, i in enumerate(cat) for j in cat[idx + 1 :]]: print(x)
... 
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)

simple product of two lists

>>> [[i, j] for i in [0, 1] for j in ['a', 'b', 'c']]
[[0, 'a'], [0, 'b'], [0, 'c'], [1, 'a'], [1, 'b'], [1, 'c']]

@josef-pkt
Copy link
Member Author

another case

testing: all levels are equal to mean
Even if we have k independent samples, the contrast is a linear combination of all variances. However, the model wald test, like t_test automatically compute the cov of an array of linear combinations. We can reuse this for some contrasts even if the intention is for standalone k-sample tests.

The contrast/restriction should be something like np.eye(k) - np.ones((k, k)) / k
(patsy has names for various encodings that correspond to different contrasts, maybe is Helmert)

If we want to reuse the models for standalone hypothesis test, then we control the structure of exog and we don't have to reverse engineer from patsy's design_info.

However, we can also just use basic linalg to get cov directly assuming we have cov_params = diag(var_means). Then cov_restriction is just C @ cov_params @ C'

Related: Can we handle trend test for ordered categorical exog, or equality null with trend alternative using the models/results? AFAIR, we are still missing standard (standalone) trend tests.
For independent samples, this would have a specific structure of one-sided hypothesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant