Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: mixed effects / repeated measures #646

Open
dengemann opened this issue Feb 5, 2013 · 71 comments
Open

ENH: mixed effects / repeated measures #646

dengemann opened this issue Feb 5, 2013 · 71 comments

Comments

@dengemann
Copy link
Contributor

How is this evolving in the moment? Looking at the sandbox there does not seem to be too much change.
I've got a couple of use cases for which I ideally need to report 'repeated measures' analysis
Also for quite some time I've been thinking about implementing a general solution applicable to to cognitive neuroscience / physiological data.
So typically it should handle scenarios like: Y ~ within_a x within_b x between_j. Y can be button press latencies, error rates, etc., or brain responses. In that case Y has to be estimated at location l, time t (and frequency f).
I might give this a shot if no one is deeply devoted to this during the next weeks or so.

@josef-pkt
Copy link
Member

Nothing going on at the moment, I haven't worked on this since #145 .

When I worked on this, I didn't have a specific use case in mind, so some of the code is all over, preparing for different models in repeated measures, mixed effects, panel data and GLS with estimated error structure.

Much of the basics/infrastructure for repeated measures in the ANOVA style is missing, however, there is quite a lot of code to estimate with non-identity error structure, going into the direction towards GEE.

It wasn't on my schedule to work on this, and nobody else is, so give it a shot.
I can help, especially if you need to find your way around the existing code.

Questions that will affect how to efficiently implement estimation for specific use cases relate to the structure of your data.
Do you have many observations per individual and few individuals (long panel), or few observations per individual and many individuals (wide panel)? balanced or unbalanced (same number of observations per individual or not)?

For the balanced case, there are also multivariate models in the sysreg pull request (GSOC).

This area is a bit large, but we don't have to worry about all different versions.
I think it would be good to get one use case out of the sandbox. Also, some of it may not be relevant for your initial repeated measures case, but might come in handy for later extensions.

Have fun and code :)

@dengemann
Copy link
Contributor Author

Yay,

On Tue, Feb 5, 2013 at 3:34 PM, Josef Perktold notifications@github.comwrote:

Nothing going on at the moment, I haven't worked on this since #145#145.

When I worked on this, I didn't have a specific use case in mind, so some
of the code is all over, preparing for different models in repeated
measures, mixed effects, panel data and GLS with estimated error structure.

Much of the basics/infrastructure for repeated measures in the ANOVA style
is missing, however, there is quite a lot of code to estimate with
non-identity error structure, going into the direction towards GEE.

It wasn't on my schedule to work on this, and nobody else is, so give it a
shot.
I can help, especially if you need to find your way around the existing
code.

Great, I'll get back to you in case I start getting lost in the sandbox
jungle ;-)

Questions that will affect how to efficiently implement estimation for
specific use cases relate to the structure of your data.

Exactly, this is someting I worry about.

Do you have many observations per individual and few individuals (long
panel), or few observations per individual and many individuals (wide
panel)? balanced or unbalanced (same number of observations per individual
or not)?

Typically my use case is of sort a) 20-30 subjects with 200-400 total
observations. But then it depends at which level I analyze data.
For behavioral studies, to avoid balancing issues, to keep things simple
and to skip the tedious sum of square types i-iii debate I would aggregate
responses. If the experiment allows for it I would e.g., take the median of
one block (which normally gets replicate a few times) so I get 3-4
observations per within subject factors cell (typically 2 x 2, so about 16
observations per individual).
It is actually quite appealing to me to model single trials. The above
pictured strategy is just quite adaptive given the publication culture in
certain fields of interest. Personally I'd like to move on though. For this
I would be happy to have procedure at hand (and in Python) that allows me
to do robust-simple-minded repeated measures and more sophisticated
hierarchical regression models.

The second use case is the more critical one and it is even clearer related
to the long panel style (although data aren't represented as panel here).
Here we would have e.g. an ndarray of subjects X conditions X time slices
X cortical vertices (x frequency). The data matrix implied ca easily eat up
some GB.
For this case I'm not exactly sure how much 'repeatedness' and ANOVA I
really need. Most often first level estimates would be generated using
variants of a mass-univariate approach to generate a spatio-temporal
statistical map which then is passed to a non-parametric test, often a
clustering permutation test.
It would be nice however to be able to specify a model allowing me to do
inferences on the group level, not only on the individual level and also to
test for interaction contrasts + main effects in one model. Maybe the GLM
framework available already provides me with most of what I need. I've got
to think about it for some more.

Anyways, having model class that is able to reproduce 2x2 repeated models
from R aov
in Python and is implemented efficiently would be very neat and useful. I
think this also might serve as another access point to statsmodels.

For the balanced case, there are also multivariate models in the sysreg

pull request (GSOC).

sounds interesting.

This area is a bit large, but we don't have to worry about all different
versions.
I think it would be good to get one use case out of the sandbox. Also,
some of it may not be relevant for your initial repeated measures case, but
might come in handy for later extensions.

Yes, definitely, as I indicated I would like something that given the right
parameters does repeated measures but also can do more. Let's see.

Have fun and code :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13131701..

@josef-pkt
Copy link
Member

sounds interesting

just two comments:

large parts of the estimation in #145 was oriented towards short and wide panels (including unbalanced for some parts). We will see how much can also be applied to long panels.

a general idea for very large datasets:
currently models and statistical tests want to have the full data. In many cases we could also base it on summary statistics, which could be more easily precalculated with large out-of-memory data. It's an old idea, but we haven't done anything in this direction since I haven't heard of anyone using datasets that large. possibly useful later on.

@dengemann
Copy link
Contributor Author

On Tue, Feb 5, 2013 at 4:44 PM, Josef Perktold notifications@github.comwrote:

sounds interesting

... not sure which expansion / unpacking algorithm to apply ;-)

just two comments:

large parts of the estimation in
#145https://github.com/statsmodels/statsmodels/issues/145was
oriented towards short and wide panels (including unbalanced for some
parts). We will see how much can also be applied to long panels.

Yay, I think it will boil down to learning about statsmodels style and
already implemented estimation functions and then writing / factoring it
from scratch. Another idea that came into my mind was to extend the GLM /
somehow fit it in the GLM framework, maybe via subclassing / arguments /
formula such to make it possible to specify an error term in the R aov
style.

a general idea for very large datasets:
currently models and statistical tests want to have the full data. In many
cases we could also base it on summary statistics, which could be more
easily precalculated with large out-of-memory data. It's an old idea, but
we haven't done anything in this direction since I haven't heard of anyone
using datasets that large. possibly useful later on.

I think i've got to slow down here. In most cases this stuff easily runs on
my 16B ram / 12 core machine. But more content-related, actually I would
need this stuff to generate first-order / level / etc. estimates. Summary
stats would kick in when it comes to estimating population effects /
modeling subject specific error terms.


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13135299.

@josef-pkt
Copy link
Member

I think it would be helpful if you start a SMEP on the Wiki. If you have specific R functions and options (like aov) and examples, then it would be useful so we can get a more specific picture of where you want to go. I still have only a vague idea what's required, and then more on the estimation parts.

Given that we have the formulas with patsy now, and Skipper's ANOVA example, it might be good to start from there and then see which parts or models are missing.

It could be useful to have an overview, before getting lost in the "jungle".

Specifically, do you need one of the GLM distribution families, or is it mainly linear models?

@dengemann
Copy link
Contributor Author

On Tue, Feb 5, 2013 at 5:29 PM, Josef Perktold notifications@github.comwrote:

I think it would be helpful if you start a SMEP on the Wiki. If you have
specific R functions and options (like aov) and examples, then it would be
useful so we can get a more specific picture of where you want to go. I
still have only a vague idea what's required, and then more on the
estimation parts.

Yes, makes a lot of sense. Me too, btw., I feel I'm stepping on 'terrain
vague'.

Given that we have the formulas with patsy now, and Skipper's ANOVA
example, it might be good to start from there and then see which parts or
models are missing.

Yes, exactly.

It could be useful to have an overview, before getting lost in the
"jungle".

Yes, first starting with something more constrained might be a good idea
(use case 1, replace R repeated measures anova for behavioral data, star
from R / aov + patsy example as starting point). I might browse through the
R data to find an experimental psychology test-data set exemplifying this
common use case.

Specifically, do you need one of the GLM distribution families, or is it

mainly linear models?

For the beginning I would concentrate on gaussian / linear / identity-link.
Sometimes I've got to deal with log models, but that's very special / maybe
for later.


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13137809.

@josef-pkt
Copy link
Member

just remembered: UCLA and some other university websites often have good example
I quick search found this http://www.ats.ucla.edu/stat/r/seminars/Repeated_Measures/repeated_measures.htm
Is this along the lines of your use case?

@dengemann
Copy link
Contributor Author

that's one of my favorites, I would often give to students as example how
to do it in R.

On Tue, Feb 5, 2013 at 5:50 PM, Josef Perktold notifications@github.comwrote:

just remembered: UCLA and some other university websites often have good
example
I quick search found this
http://www.ats.ucla.edu/stat/r/seminars/Repeated_Measures/repeated_measures.htm
Is this along the lines of your use case?


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13139092.

@dengemann
Copy link
Contributor Author

this is another one which is quite instructive for use case 1 (behavioral
data):

http://personality-project.org/r/r.anova.html

On Tue, Feb 5, 2013 at 5:53 PM, Denis-Alexander Engemann <
denis.engemann@gmail.com> wrote:

that's one of my favorites, I would often give to students as example how
to do it in R.

On Tue, Feb 5, 2013 at 5:50 PM, Josef Perktold notifications@github.comwrote:

just remembered: UCLA and some other university websites often have good
example
I quick search found this
http://www.ats.ucla.edu/stat/r/seminars/Repeated_Measures/repeated_measures.htm
Is this along the lines of your use case?


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13139092.

@dengemann
Copy link
Contributor Author

... and finally:
http://www.psych.upenn.edu/~baron/rpsych/rpsych.html#htoc52

sorry for the incremental updates.

On Tue, Feb 5, 2013 at 5:57 PM, Denis-Alexander Engemann <
denis.engemann@gmail.com> wrote:

this is another one which is quite instructive for use case 1 (behavioral
data):

http://personality-project.org/r/r.anova.html

On Tue, Feb 5, 2013 at 5:53 PM, Denis-Alexander Engemann <
denis.engemann@gmail.com> wrote:

that's one of my favorites, I would often give to students as example how
to do it in R.

On Tue, Feb 5, 2013 at 5:50 PM, Josef Perktold notifications@github.comwrote:

just remembered: UCLA and some other university websites often have good
example
I quick search found this
http://www.ats.ucla.edu/stat/r/seminars/Repeated_Measures/repeated_measures.htm
Is this along the lines of your use case?


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13139092.

@josef-pkt
Copy link
Member

examples or descriptions like these were also some of the references that I used for short panel, for example I got a start on correlation structures to use with short panels here https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/panel/correlation_structures.py

@dengemann
Copy link
Contributor Author

I think I found one.
lme4::sleepstudy should be a feasible example.

On Tue, Feb 5, 2013 at 5:59 PM, Josef Perktold notifications@github.comwrote:

examples or descriptions like these were also some of the references that
I used for short panel, for example I got a start on correlation structures
to use with short panels here
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/panel/correlation_structures.py


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13139559.

@dengemann
Copy link
Contributor Author

unfortunately it misses real experimental conditions, let's see what else
can be found.

On Tue, Feb 5, 2013 at 6:03 PM, Denis-Alexander Engemann <
denis.engemann@gmail.com> wrote:

I think I found one.
lme4::sleepstudy should be a feasible example.

On Tue, Feb 5, 2013 at 5:59 PM, Josef Perktold notifications@github.comwrote:

examples or descriptions like these were also some of the references that
I used for short panel, for example I got a start on correlation structures
to use with short panels here
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/panel/correlation_structures.py


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13139559.

@josef-pkt
Copy link
Member

a partial map to the Jungle:

The two main current classes for this are OneWayMixed and ShortPanelGLS

some example scripts are there
"statsmodels\sandbox\examples\ex_random_panel.py"
"statsmodels\sandbox\examples\ex_mixed_lls_0.py"
"statsmodels\sandbox\examples\ex_mixed_lls_re.py"
"statsmodels\sandbox\examples\ex_mixed_lls_timecorr.py"

Both models assume we have a block correlation structure by individual (within) but not across individuals

OneWayMixed has currently a refactoring bug, uses EM algorithm, and is mainly a cleaned up version of the original model when stats.models was in scipy
ShortPanelGLS is my version and uses iterative generalized least squares

In both cases, we need enough individuals to estimate the correlation parameters across individuals, or we need to assume as simple correlation structure so we don't need to estimate many parameters for it.

OneWayMixed also allows for random effects but I haven't found a direct R analog to write tests for it.

@dengemann
Copy link
Contributor Author

Thanks heaps!

On Tue, Feb 5, 2013 at 6:20 PM, Josef Perktold notifications@github.comwrote:

a partial map to the Jungle:

The two main current classes for this are OneWayMixed and ShortPanelGLS

some example scripts are there
"statsmodels\sandbox\examples\ex_random_panel.py"
"statsmodels\sandbox\examples\ex_mixed_lls_0.py"
"statsmodels\sandbox\examples\ex_mixed_lls_re.py"
"statsmodels\sandbox\examples\ex_mixed_lls_timecorr.py"

Both models assume we have a block correlation structure by individual
(within) but not across individuals

which is something one could feed into a summary stats approach...

OneWayMixed has currently a refactoring bug, uses EM algorithm, and is
mainly a cleaned up version of the original model when stats.models was in
scipy
ShortPanelGLS is my version and uses iterative generalized least squares

In both cases, we need enough individuals to estimate the correlation
parameters across individuals,

what does enough mean here? n >= 20?

or we need to assume as simple correlation structure so we don't need to
estimate many parameters for it.

which is not a big deal for simple cases with 2 factor levels...

OneWayMixed also allows for random effects but I haven't found a direct R
analog to write tests for it.

there must be one ...

So what's missing here basically is general is supoprt for k-way
interaction terms?

Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13140570.

@vincentarelbundock
Copy link
Contributor

FYI, I just added a bunch of new data from the psych and HSAUR packages to the Rdatasets archive. For example, this ANOVA tutorial uses weightgain, foster, water, skulls from the HSAUR package:

http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_analysis_of_variance.pdf

CSV datasets listed here:

http://vincentarelbundock.github.com/Rdatasets/

@dengemann
Copy link
Contributor Author

Yay, this is neat!
Thanks for pointing it out, I'm sure that will be very helpful.

On Tue, Feb 5, 2013 at 6:31 PM, Vincent Arel-Bundock <
notifications@github.com> wrote:

FYI, I just added a bunch of new data from the psych and HSAUR packages to
the Rdatasets archive. For example, this ANOVA tutorial uses weightgain,
foster, water, skulls from the HSAUR package:

http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_analysis_of_variance.pdf

CSV datasets listed here:

http://vincentarelbundock.github.com/Rdatasets/


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13141199.

@josef-pkt
Copy link
Member

In both cases, we need enough individuals to estimate the correlation
parameters across individuals,

what does enough mean here? n >= 20?

that depends on the case, for unstructured the number of parameters is
T*(T-1)/2 (or close to it), full within correlation matrix where T is the
number of observations per individual. if T is large, number of individuals
has to be much larger.
For AR(1) and other cases we only have one parameter to estimate,
estimating AR(1) also uses the intertemporal information, so we need fewer
individuals.

or we need to assume as simple correlation structure so we don't need to
estimate many parameters for it.

which is not a big deal for simple cases with 2 factor levels...

OneWayMixed also allows for random effects but I haven't found a direct
R
analog to write tests for it.

there must be one ...

IIRC, there are different algorithms used, close enough to get similar
results in some cases, but not close enough to catch smaller bugs (degrees
of freedom or similar) or to test all cases.

So what's missing here basically is general is supoprt for k-way
interaction terms?

Interaction effects can be specified as fixed effects, the assumption is on
the error structure (random effects). For example, if there is one
correlation in the error that is individual specific and one that is time
specific, then the case is not handled by the current structure. (This is a
common use case in economics for panel data.)
So, multi-way random effects across and within individuals are not covered.
random effects within individuals can be pretty much anything.

OneWayMixed allows for explanatory variables in the random effects part.
It's very flexible, but I haven't figure out all the ways that this could
be used.

@dengemann
Copy link
Contributor Author

On Tue, Feb 5, 2013 at 6:57 PM, Josef Perktold notifications@github.comwrote:

In both cases, we need enough individuals to estimate the correlation
parameters across individuals,

what does enough mean here? n >= 20?

that depends on the case, for unstructured the number of parameters is
T*(T-1)/2 (or close to it), full within correlation matrix where T is the
number of observations per individual. if T is large, number of individuals
has to be much larger.
For AR(1) and other cases we only have one parameter to estimate,
estimating AR(1) also uses the intertemporal information, so we need fewer
individuals.

I see.

or we need to assume as simple correlation structure so we don't need
to
estimate many parameters for it.

which is not a big deal for simple cases with 2 factor levels...

OneWayMixed also allows for random effects but I haven't found a direct
R
analog to write tests for it.

there must be one ...

IIRC, there are different algorithms used, close enough to get similar
results in some cases, but not close enough to catch smaller bugs (degrees
of freedom or similar) or to test all cases.

mhm.

So what's missing here basically is general is supoprt for k-way
interaction terms?

Interaction effects can be specified as fixed effects, the assumption is on
the error structure (random effects).

yes, got it.

For example, if there is one
correlation in the error that is individual specific and one that is time
specific, then the case is not handled by the current structure. (This is a
common use case in economics for panel data.)

Ok, we're getting closer. In most cases my error term includes the
correlation due to the repeated responses from the individual, however the
conditions from the randomized block design (2 factors, 2-3 levels each,
fully balanced) are taken into account. I most often would use models
expressed by a formula like this:

Y ~ (A * B) + Error(Subject / (A * B))

So, multi-way random effects across and within individuals are not covered.
random effects within individuals can be pretty much anything.

OneWayMixed allows for explanatory variables in the random effects part.

That sounds interesting.
I think I've now got to play with the materials we collected during this
conversation and come back with a first, more concrete proposal.

It's very flexible, but I haven't figure out all the ways that this could
be used.


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13142532.

@josef-pkt
Copy link
Member

One more comment before you go to play

Y ~ (A * B) + Error(Subject / (A * B)) sounds like what OneWayMixed might be doing
http://stats.stackexchange.com/questions/13784/repeated-measures-anova-with-lme-in-r-for-two-within-subject-factors

I didn't or don't understand enough R to come up with examples like these, so maybe I missed the comparison case in R because it's all hidden inside the formulas.

@dengemann
Copy link
Contributor Author

On Tue, Feb 5, 2013 at 7:42 PM, Josef Perktold notifications@github.comwrote:

One more comment before you go to play

Y ~ (A * B) + Error(Subject / (A * B)) sounds like what OneWayMixed might
be doing

http://stats.stackexchange.com/questions/13784/repeated-measures-anova-with-lme-in-r-for-two-within-subject-factors

I didn't or don't understand enough R to come up with examples like these,
so maybe I missed the comparison case in R because it's all hidden inside
the formulas.

That was my suspicion and the reason I posted this line. We might want to
investigate this using a test candidate. Maybe it will pass if supplied
with the appropriate case ...
Terminology is permanent source of confusion. But also the formulas are.
Admittedly I'm having a hard time understanding how the formulas are
expanded and how they translate to matrices when reading the related R
source code, e.g. of aov


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13144436.

@dengemann
Copy link
Contributor Author

So how yould this kind of model look like using OneWayMixed?
Could I feed it with the same formula using patsy?
If it turns out to work we would that would generate material for the
example section.

On Tue, Feb 5, 2013 at 7:57 PM, Denis-Alexander Engemann <
denis.engemann@gmail.com> wrote:

On Tue, Feb 5, 2013 at 7:42 PM, Josef Perktold notifications@github.comwrote:

One more comment before you go to play

Y ~ (A * B) + Error(Subject / (A * B)) sounds like what OneWayMixed might
be doing

http://stats.stackexchange.com/questions/13784/repeated-measures-anova-with-lme-in-r-for-two-within-subject-factors

I didn't or don't understand enough R to come up with examples like
these, so maybe I missed the comparison case in R because it's all hidden
inside the formulas.

That was my suspicion and the reason I posted this line. We might want to
investigate this using a test candidate. Maybe it will pass if supplied
with the appropriate case ...
Terminology is permanent source of confusion. But also the formulas are.
Admittedly I'm having a hard time understanding how the formulas are
expanded and how they translate to matrices when reading the related R
source code, e.g. of aov


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13144436.

@josef-pkt
Copy link
Member

The code is still in the sandbox for a reason :)

There is no nice user interface yet, and it predates the addition of formulas.

If you have categorical variables, then you could use patsy to create the design matrices.
The interface currently still requires to define Units, collecting the data and calculations for each individual
as in this example "statsmodels\sandbox\examples\ex_mixed_lls_re.py"

get y, X, Z for each individual and create a list of Units
y is endog,
X is exog for fixed effects, other regressors and should include Z
Z is exog for the random effects

pseudo code:

for all individuals:
create Y, X, Z
unit = Unit(Y, X, Z)
units.append(unit)

then call:
mod = OneWayMixed(units)
after fit the results should be available, but I don't remember how far I got with creating a results instance instead of attaching the estimates to the model instance

@dengemann
Copy link
Contributor Author

yo, this sounds like the first concrete project. adding a modern interface and establishing tests. Wdyt?

On 05.02.2013, at 20:36, Josef Perktold notifications@github.com wrote:

The code is still in the sandbox for a reason :)

There is no nice user interface yet, and it predates the addition of formulas.

If you have categorical variables, then you could use patsy to create the design matrices.
The interface currently still requires to define Units, collecting the data and calculations for each individual
as in this example "statsmodels\sandbox\examples\ex_mixed_lls_re.py"

get y, X, Z for each individual and create a list of Units
y is endog,
X is exog for fixed effects, other regressors and should include Z
Z is exog for the random effects

pseudo code:

for all individuals:
create Y, X, Z
unit = Unit(Y, X, Z)
units.append(unit)

then call:
mod = OneWayMixed(units)
after fit the results should be available, but I don't remember how far I got with creating a results instance instead of attaching the estimates to the model instance


Reply to this email directly or view it on GitHub.

@josef-pkt
Copy link
Member

Yes, especially the tests. I made some adjustments based on vague references and trial and error. It looks like it works well in my examples, but no guarantee.

For a nice interface, it would be good if you could keep it somewhat separated from the implementation. We can reuse it for other models, and we can change the implementation of OneWayMixed if we want to make it more "typical" statsmodels.

@dengemann
Copy link
Contributor Author

Let's first see whether it reproduces the lmer / aov examples,

On 05.02.2013, at 20:54, Josef Perktold notifications@github.com wrote:

Yes, especially the tests. I made some adjustments based on vague references and trial and error. It looks like it works well in my examples, but no guarantee.

For a nice interface, it would be good if you could keep it somewhat separated from the implementation. We can reuse it for other models, and we can change the implementation of OneWayMixed if we want to make it more "typical" statsmodels.

this sounds like a private function for the computation might be what we need.


Reply to this email directly or view it on GitHub.

@josef-pkt
Copy link
Member

@dengemann Given the discussions in the documentation improvement issues, two related issues

Are you familiar with:

Sphericity tests: especially for ANOVA style repeated measures, there is a lot of emphasis on sphericity tests and df corrections (SPSS, SAS)

sandwich robust covariance: One alternative to assuming a correlation structure, would be to use robust covariance estimators for inference (shouldn't be difficult since we have cluster robust covariance estimators already)

(I never figured out how these two are related, it looks to me that they go after the idea of correcting the inference, but I didn't see any overlap in the literature.)

@dengemann
Copy link
Contributor Author

On Fri, Feb 8, 2013 at 1:49 PM, Josef Perktold notifications@github.comwrote:

@dengemann https://github.com/dengemann Given the discussions in the
documentation improvement issues, two related issues

Are you familiar with:

Sphericity tests: especially for ANOVA style repeated measures, there is a
lot of emphasis on sphericity tests and df corrections (SPSS, SAS)

yo this seems to be a no-brainer (commonly observable in the medical /
psychological world) of which many people don't know. Sphericity
(assumption about variance quality in the covariance matrix) actually does
not become relevant if factor-levels are 2 per factor, see some Textbooks
(I don't remember right now) or:

http://en.wikipedia.org/wiki/Mauchly's_sphericity_test

You would see many papers implying 2x2 designs reporting sphericity stats,
etc.
If sphericity matters and is violated, there are correction procedures such
as as Greenhous-Geiser or Hyun-Feldt to estimate the real degrees of
freedom (which get's messed up in case sphericity is violated).

sandwich robust covariance: One alternative to assuming a correlation
structure, would be to use robust covariance estimators for inference
(shouldn't be difficult since we have cluster robust covariance estimators
already)

not familiar with those guys...

(I never figured out how these two are related, it looks to me that they
go after the idea of correcting the inference, but I didn't see any overlap
in the literature.)

But it seems they might nicely integrate in cases of sphericity-violation.

Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13288943.

@dengemann
Copy link
Contributor Author

On Fri, Feb 8, 2013 at 2:13 PM, Denis-Alexander Engemann <
denis.engemann@gmail.com> wrote:

On Fri, Feb 8, 2013 at 1:49 PM, Josef Perktold notifications@github.comwrote:

@dengemann https://github.com/dengemann Given the discussions in the
documentation improvement issues, two related issues

Are you familiar with:

Sphericity tests: especially for ANOVA style repeated measures, there is
a lot of emphasis on sphericity tests and df corrections (SPSS, SAS)

yo this seems to be a no-brainer (commonly observable in the medical /
psychological world) of which many people don't know. Sphericity
(assumption about variance quality in the covariance matrix) actually does
not become relevant if factor-levels are 2 per factor, see some Textbooks
(I don't remember right now) or:

http://en.wikipedia.org/wiki/Mauchly's_sphericity_test

You would see many papers implying 2x2 designs reporting sphericity stats,
etc.

sorry, only 2-factor levels, to be precise

If sphericity matters and is violated, there are correction procedures
such as as Greenhous-Geiser or Hyun-Feldt to estimate the real degrees of
freedom (which get's messed up in case sphericity is violated).

sandwich robust covariance: One alternative to assuming a correlation
structure, would be to use robust covariance estimators for inference
(shouldn't be difficult since we have cluster robust covariance estimators
already)

not familiar with those guys...

(I never figured out how these two are related, it looks to me that they
go after the idea of correcting the inference, but I didn't see any overlap
in the literature.)

But it seems they might nicely integrate in cases of sphericity-violation.

Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13288943.

@josef-pkt
Copy link
Member

sphericity: I didn't know its not needed for two factors, I never read the small print

robust cov estimators: The basic idea is that the parameter estimates with OLS are consistent even if homogeneity and no within correlation are violated, but the OLS estimate of the cov_params is wrong. robust cov estimators calculate the correct cov_params that is robust to this violation of uncorrelated and identical error terms.

http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors
also mentioned here http://en.wikipedia.org/wiki/Generalized_estimating_equation
for cluster (individual would be the cluster)
http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/

I don't know if there is a special case if there are only two factors levels (never heard about that either).

@josef-pkt
Copy link
Member

I made lots of starts on this, I translated a matlab module, that looks like it covers quite a lot, but I never was sure whether I got the pieces right (and I guess I didn't get all the details).

The problem I have is that I have a reasonably good background in panel data analysis in econometrics, but not in the statistics approach to it. I still haven't really figured out REML, what's noise and what's the real underlying structure.

My impression is that unless you have a good overview over the models, or a very good reference, it's better to start with one specific model and get that to work and then expand once the structure is clearer. (At least that would apply to me.)
And having good test cases to reverse engineer the results makes life much easier.

@dengemann
Copy link
Contributor Author

this makes sense, I fully agree with what you say, let's go specific. we should try to find one good reference and one nice dataset with sufficient observations (long panel) and at least two repeated factors (more ore less balanced).

On 11.02.2013, at 01:47, Josef Perktold notifications@github.com wrote:

I made lots of starts on this, I translated a matlab module, that looks like it covers quite a lot, but I never was sure whether I got the pieces right (and I guess I didn't get all the details).

The problem I have is that I have a reasonably good background in panel data analysis in econometrics, but not in the statistics approach to it. I still haven't really figured out REML, what's noise and what's the real underlying structure.

I'm new to this too, still getting into it. just btw. don't we already have REML in sm?
My impression is that unless you have a good overview over the models, or a very good reference, it's better to start with one specific model and get that to work and then expand once the structure is clearer. (At least that would apply to me.)
And having good test cases to reverse engineer the results makes life much easier.

:-)

Reply to this email directly or view it on GitHub..

@josef-pkt
Copy link
Member

we don't have any REML in statsmodels, at least not by that name, except is in a few parts of OneWayMixed.

I tried a few more ways with OneWayMixed but didn't get any closer (but managed to get a segfault with linalg).
I might remember the underlying model of OneWayMixed wrongly. I need to look at the reference again, but I guess now that includes correlation across random effects.

roughly looking at the difference in the underlying stochastic structure

factor random effects:
y = x*beta + u_k + v_j + epsilon, u_k, v_j, epsilon are uncorrelated across time periods/repetitions, and possibly each other, k could be worker, j could be machine, or u_k + v_j could be intersection of worker and machine.
each element of y is an observation

block structure

don't edit in internet explorer, it eats half of your text when github insists on silly popups:
🎱

@dengemann
Copy link
Contributor Author

we don't have any REML in statsmodels, at least not by that name, except is in a few parts of OneWayMixed.

Btw. I just found this in NiPy https://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/spm/reml.py
The question is how many cases are covered by this.

don't edit in internet explorer, it eats half of your text when github insists on silly popups:

wrong browser? ^^

I tried a few more ways with OneWayMixed but didn't get any closer (but managed to get a segfault with linalg).
I might remember the underlying model of OneWayMixed wrongly. I need to look at the reference again, but I guess now that includes correlation across random effects.

I have to admit that the data model of OneWayMixed seems a bit exceptional to me. Was constructing lists of units the old way of assembling the data structures required? I'm having a hard time getting used to this representation.

@josef-pkt
Copy link
Member

Splitting it up into Units is still Jonathan Taylor's original structure. It's a bit unfamiliar, but it has the advantage of keeping the calculation for each unit together. The interface will definitely change to (endog, exog, groups) where groups are the factor for the units. However, I didn't want to try to change the internal structure, until I/we have a test case and unit tests.
as you can see from some of the comments in the code, it's pretty much a direct translation of the reference into code.

Here two version I used to try it out yesterday
https://gist.github.com/josef-pkt/4754200
https://gist.github.com/josef-pkt/4754225 this one has separate worker-machine units

I checked the reference again
Maximum Likelihood Computations with Repeated Measures: Application of the EM Algorithm
Author(s): Nan Laird, Nicholas Lange, Daniel StramReviewed work(s):Source: Journal of the American Statistical Association, Vol. 82, No. 397 (Mar., 1987),

As I started to realize yesterday, the current implementation has unconstrained covariances of random factors.
Unconstraint D in the article, AFAICRemember.
Laird et al. mention constraint D cases, and the partial overlap with rANOVA.

I still don't know why we don't get the random effects in the Machines example, Laird et al. describe one special case, where the parameter estimates of the mixed model coincide with OLS. (balanced, and ....)

@josef-pkt
Copy link
Member

about the reml in nipy:
I didn't know about this, and haven't looked into nipy in some time.
Looks like a nice standalone function to get the Variance decomposition. Without much documentation and no reference it might be a bit of work to reverse engineer what it does.
No separate unit tests for the function, but I didn't see if it's covered by the unit tests higher up.

If this works or can be made to work as standalone function, then it would be a nice function to plug into other models.

@josef-pkt
Copy link
Member

the reml function, triggered my memory a bit more.

On the very abstract level we just have a model y = x beta + u, u distributed N(0, V(z, theta))
vectors, matrices over all observations
x, z are explanatory variables, variables with random coefficients are both in x and z.

We can just use iterated or two-stage GLS or maximum likelihood.
The main problem is that V is (nobs,nobs) in general which is too large to work efficiently without any specific structure.
The "game" is then to find algorithms that exploit the specific structure of V and good estimates of the covariance parameters theta. Functions like nipy's reml are part of it.
The other part is to get an efficient whitening of exog and endog in GLS, and additionally the determinant of V for MLE.

my attempt for this for the flat panel case with block structure as in the repeated measures models is in "statsmodels\sandbox\panel\panel_short.py"

@josef-pkt
Copy link
Member

Maybe I got you now as confused or lost as I sometimes feel. (Is it possible to see the forest and the trees at the same time?)
(my comment in panel_short.py starts with "starting from scratch", trying to get to grips with how to implement this, and finding what could be an alternative to the Units in OneWayMixed.)

As I mentioned before, maybe you want to get started with a model that is close to your usecase and with a good reference, and implement that.

@dengemann
Copy link
Contributor Author

... just to add another tree to our forest, also NiPy:
https://github.com/nipy/nipy/blob/master/nipy/algorithms/statistics/mixed_effects_stat.py

Thanks for the examples posted. So If get it right units are the 'entities' which get repeatedly observed?
I have not yet looked into panel_short, hold on.
What I'm doing now is reanalysing some of my data using lme4 to get a better feeling for this.
Looking at the typical mixed models (MM) analysis style the problem I see with 'select one model and write that from scratch' is that the common MM method implies comparing different models using likelihood ratio tests. Or did you refer to 'one model' in a more generic way?

@josef-pkt
Copy link
Member

The smallest class of models that cover your cases, unless this is the entire forest ;)

just as example: If all your explanatory variables in the random effects Z are categorical (no time trend or continuous variables) and you assume the individual random effects are independent from each other (diagonal D in OneWayMixed, if I read it correctly), then you are in the case of what I called factor random structure, which looks also to be the case for the nipy mixed_effects_stat.py from the module docstring.

That's a narrower class from taking various covariance structures in the random effects into account.

In OneWayMixed, Units are the (repeated measures) individuals, one Unit collects all observations for one individual. What consitutes a Unit, is given by the assumption on the block structure of the covariance of the error term, correlation within Units, no error correlation across Units

@josef-pkt
Copy link
Member

just one more (while shutting down open windows and tabs)

http://lme4.r-forge.r-project.org/slides/2011-03-16-Amsterdam/4Theory.pdf page 7
"There are sophisticated and e�fficient ways of calculating a sparse
Cholesky factor, which is a sparse, lower-triangular matrix ..."
that's the numerical missing piece to get an efficient general solution as in lme4

@dengemann
Copy link
Contributor Author

... good you name it. Today I fully the above cited 2008 paper, the linked vignettes here as well as some parts of the lme4 sources (ironically the C/C++ is easier to read than R). I had the impression that the Cholsky factorization might be one key ingredient.

@jseabold
Copy link
Member

I don't recall all the details (IANAL)... I don't think we could distribute it because it's GPL, but we can certainly use it, if available.

https://github.com/njsmith/scikits-sparse

@dengemann
Copy link
Contributor Author

it's BSD but has GPL ingredients. A weak dependency could work. What about scipy.linalg.cholesky?

@josef-pkt
Copy link
Member

I think GPL infection is pretty strict, we still need to have an alternative implementation. It's not LGPL where it's allowed to link against.
IIRC, scikits-sparse is not supported on Windows.

scipy.linalg is all (or almost all) dense, sparse support has improved in scipy, especially since scikit-learn, and I haven't kept up with it, but I there was nothing in scipy.sparse that could be used directly.

@dengemann
Copy link
Contributor Author

I think in scikit-learn there a Cholesky factorization is implemented in
several places.

!git grep Cholesky

On Mon, Feb 11, 2013 at 7:31 PM, Josef Perktold notifications@github.comwrote:

I think GPL infection is pretty strict, we still need to have an
alternative implementation. It's not LGPL where it's allowed to link
against.
IIRC, scikits-sparse is not supported on Windows.

scipy.linalg is all (or almost all) dense, sparse support has improved in
scipy, especially since scikit-learn, and I haven't kept up with it, but I
there was nothing in scipy.sparse that could be used directly.


Reply to this email directly or view it on GitHubhttps://github.com//issues/646#issuecomment-13394773.

@josef-pkt
Copy link
Member

It might not be so difficult to implement it directly since it's for a specific use case. But I never looked very closely at it.

Once, I tried for a different kind of mixed model, with several random factors and not the repeated measures block structure. In that case the inverse covariance matrix was not sparse and I gave up with the generic approach.
For cluster robust standard errors, I have a version that uses separate sparse covariance matrices for each factor/group variable and it worked pretty well, but that's a much simpler problem.

@jseabold
Copy link
Member

I thought scikit-learn might have sparse cholesky, but I didn't see it.

@dengemann
Copy link
Contributor Author

it's more context-related, no API or so. Implemented on demand, if you wish.

On 11.02.2013, at 20:13, Skipper Seabold notifications@github.com wrote:

I thought scikit-learn might have sparse cholesky, but I didn't see it.


Reply to this email directly or view it on GitHub.

@josef-pkt
Copy link
Member

before you go hunting for everything Cholesky ;)
I think what we need is cholsigmainv, the cholesky decomposition of the inverse covariance matrix, or of a dot operator defined with it.

@dengemann
Copy link
Contributor Author

Short update on lmr sources studies:

  • after some initial struggle this seems more straight forward than initially assumed. You have to buy some R peculiarities like mc <- match.call() which is just a way of packaging and dispatching arguments from the called function to another e.g. private function, roughly similar to kwargs.
  • The interesting functions in lmer.R are then lmerFrame, lmerFactorList and lmer_finalize that takes the results from the two former to construct a 'mer' object calling new. This is also where interesting interfacing with the C code takes place.
  • To interactively explore the matrices, etc. of this 'mer' instance returned from lmer (via lmer_finalize) use @ as membership operator, did not know that...

@dengemann
Copy link
Contributor Author

before you go hunting for everything Cholesky ;)

and

just one more (while shutting down open windows and tabs)

This seems to be a nice blog:
http://quantstart.com/articles/Cholesky-Decomposition-in-Python-and-NumPy

@josef-pkt
Copy link
Member

to add one more tree, since I just saw a reference:

what I called factor random effects are usually in the literature as variance components mixed or random effects models.
The solution for known variances can be written so it has normal equations in the same way (mostly) as stochastic restrictions in a linear model. That was my "classical" way to impose prior assumptions/restrictions similar to hierarchical models for Bayesian analysis. (I haven't worked yet on estimating the (co)variances for this, as two step or iterative estimator.)

@dengemann
Copy link
Contributor Author

On 14.02.2013, at 19:24, Josef Perktold notifications@github.com wrote:

to add one more tree, since I just saw a reference:

what I called factor random effects are usually in the literature as variance components mixed or random effects models.

ok, good to have that link, this binds a few trees ;-)

The solution for known variances can be written so it has normal equations in the same way (mostly) as stochastic restrictions in a linear model. That was my "classical" way to impose prior assumptions/restrictions similar to hierarchical models for Bayesian analysis. (I haven't worked yet on estimating the (co)variances for this, as two step or iterative estimator.)


Reply to this email directly or view it on GitHub.

@yl565
Copy link
Contributor

yl565 commented Sep 26, 2016

Is there anyone working on this currently? I happens to work a lot with mixed model with correlated errors and would like to give it a shot.

@kshedden
Copy link
Contributor

Yes, this was merged a few years ago and is available as the MixedLM class. Please try it out and open an issue if needed or ask a question at https://groups.google.com/forum/#!forum/pystatsmodels.

@yl565
Copy link
Contributor

yl565 commented Sep 26, 2016

Hmm...Maybe I missed something but I thought the current MixedLM supports random effect but do not allow specifying error covariance structure, like the correlation in R lme

lme(Y ~ X, random=~1 | id, correlation=corCompSymm(form=~1|id), method="ML", data=data)

@kshedden
Copy link
Contributor

I don't know the lme syntax very well, but I believe the model you have specified using corCompSymm should be equivalent to a random intercept model, which we can certainly handle. But we cannot handle time-series like dependence structures such as AR. I would like to do that at some point but no time now (it is not in lme4 either AFAIK).

Our implementation is more similar to lme4 than to lme/nlme. Again, AFAIK we can handle the full range of model specifications supported by lme4 (and even a few that I do not know how to specify in lme4). Our vc syntax is quite general albeit somewhat harder to use. Also, our code will run moderately/dramatically slower depending on the specifics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants