-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: mixed effects / repeated measures #646
Comments
Nothing going on at the moment, I haven't worked on this since #145 . When I worked on this, I didn't have a specific use case in mind, so some of the code is all over, preparing for different models in repeated measures, mixed effects, panel data and GLS with estimated error structure. Much of the basics/infrastructure for repeated measures in the ANOVA style is missing, however, there is quite a lot of code to estimate with non-identity error structure, going into the direction towards GEE. It wasn't on my schedule to work on this, and nobody else is, so give it a shot. Questions that will affect how to efficiently implement estimation for specific use cases relate to the structure of your data. For the balanced case, there are also multivariate models in the sysreg pull request (GSOC). This area is a bit large, but we don't have to worry about all different versions. Have fun and code :) |
Yay, On Tue, Feb 5, 2013 at 3:34 PM, Josef Perktold notifications@github.comwrote:
The second use case is the more critical one and it is even clearer related Anyways, having model class that is able to reproduce 2x2 repeated models For the balanced case, there are also multivariate models in the sysreg
Yes, definitely, as I indicated I would like something that given the right
|
sounds interesting just two comments: large parts of the estimation in #145 was oriented towards short and wide panels (including unbalanced for some parts). We will see how much can also be applied to long panels. a general idea for very large datasets: |
On Tue, Feb 5, 2013 at 4:44 PM, Josef Perktold notifications@github.comwrote:
|
I think it would be helpful if you start a SMEP on the Wiki. If you have specific R functions and options (like aov) and examples, then it would be useful so we can get a more specific picture of where you want to go. I still have only a vague idea what's required, and then more on the estimation parts. Given that we have the formulas with patsy now, and Skipper's ANOVA example, it might be good to start from there and then see which parts or models are missing. It could be useful to have an overview, before getting lost in the "jungle". Specifically, do you need one of the GLM distribution families, or is it mainly linear models? |
On Tue, Feb 5, 2013 at 5:29 PM, Josef Perktold notifications@github.comwrote:
Specifically, do you need one of the GLM distribution families, or is it
|
just remembered: UCLA and some other university websites often have good example |
that's one of my favorites, I would often give to students as example how On Tue, Feb 5, 2013 at 5:50 PM, Josef Perktold notifications@github.comwrote:
|
this is another one which is quite instructive for use case 1 (behavioral http://personality-project.org/r/r.anova.html On Tue, Feb 5, 2013 at 5:53 PM, Denis-Alexander Engemann <
|
... and finally: sorry for the incremental updates. On Tue, Feb 5, 2013 at 5:57 PM, Denis-Alexander Engemann <
|
examples or descriptions like these were also some of the references that I used for short panel, for example I got a start on correlation structures to use with short panels here https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/panel/correlation_structures.py |
I think I found one. On Tue, Feb 5, 2013 at 5:59 PM, Josef Perktold notifications@github.comwrote:
|
unfortunately it misses real experimental conditions, let's see what else On Tue, Feb 5, 2013 at 6:03 PM, Denis-Alexander Engemann <
|
a partial map to the Jungle: The two main current classes for this are OneWayMixed and ShortPanelGLS some example scripts are there Both models assume we have a block correlation structure by individual (within) but not across individuals OneWayMixed has currently a refactoring bug, uses EM algorithm, and is mainly a cleaned up version of the original model when stats.models was in scipy In both cases, we need enough individuals to estimate the correlation parameters across individuals, or we need to assume as simple correlation structure so we don't need to estimate many parameters for it. OneWayMixed also allows for random effects but I haven't found a direct R analog to write tests for it. |
Thanks heaps! On Tue, Feb 5, 2013 at 6:20 PM, Josef Perktold notifications@github.comwrote:
which is not a big deal for simple cases with 2 factor levels...
So what's missing here basically is general is supoprt for k-way —
|
FYI, I just added a bunch of new data from the psych and HSAUR packages to the Rdatasets archive. For example, this ANOVA tutorial uses weightgain, foster, water, skulls from the HSAUR package: http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_analysis_of_variance.pdf CSV datasets listed here: |
Yay, this is neat! On Tue, Feb 5, 2013 at 6:31 PM, Vincent Arel-Bundock <
|
that depends on the case, for unstructured the number of parameters is
IIRC, there are different algorithms used, close enough to get similar
Interaction effects can be specified as fixed effects, the assumption is on OneWayMixed allows for explanatory variables in the random effects part. |
On Tue, Feb 5, 2013 at 6:57 PM, Josef Perktold notifications@github.comwrote:
yes, got it.
Ok, we're getting closer. In most cases my error term includes the Y ~ (A * B) + Error(Subject / (A * B))
That sounds interesting.
|
One more comment before you go to play Y ~ (A * B) + Error(Subject / (A * B)) sounds like what OneWayMixed might be doing I didn't or don't understand enough R to come up with examples like these, so maybe I missed the comparison case in R because it's all hidden inside the formulas. |
On Tue, Feb 5, 2013 at 7:42 PM, Josef Perktold notifications@github.comwrote:
|
So how yould this kind of model look like using OneWayMixed? On Tue, Feb 5, 2013 at 7:57 PM, Denis-Alexander Engemann <
|
The code is still in the sandbox for a reason :) There is no nice user interface yet, and it predates the addition of formulas. If you have categorical variables, then you could use patsy to create the design matrices. get y, X, Z for each individual and create a list of Units pseudo code: for all individuals: then call: |
yo, this sounds like the first concrete project. adding a modern interface and establishing tests. Wdyt? On 05.02.2013, at 20:36, Josef Perktold notifications@github.com wrote:
|
Yes, especially the tests. I made some adjustments based on vague references and trial and error. It looks like it works well in my examples, but no guarantee. For a nice interface, it would be good if you could keep it somewhat separated from the implementation. We can reuse it for other models, and we can change the implementation of OneWayMixed if we want to make it more "typical" statsmodels. |
Let's first see whether it reproduces the lmer / aov examples, On 05.02.2013, at 20:54, Josef Perktold notifications@github.com wrote:
this sounds like a private function for the computation might be what we need.
|
@dengemann Given the discussions in the documentation improvement issues, two related issues Are you familiar with: Sphericity tests: especially for ANOVA style repeated measures, there is a lot of emphasis on sphericity tests and df corrections (SPSS, SAS) sandwich robust covariance: One alternative to assuming a correlation structure, would be to use robust covariance estimators for inference (shouldn't be difficult since we have cluster robust covariance estimators already) (I never figured out how these two are related, it looks to me that they go after the idea of correcting the inference, but I didn't see any overlap in the literature.) |
On Fri, Feb 8, 2013 at 1:49 PM, Josef Perktold notifications@github.comwrote:
http://en.wikipedia.org/wiki/Mauchly's_sphericity_test You would see many papers implying 2x2 designs reporting sphericity stats,
|
On Fri, Feb 8, 2013 at 2:13 PM, Denis-Alexander Engemann <
sorry, only 2-factor levels, to be precise
|
sphericity: I didn't know its not needed for two factors, I never read the small print robust cov estimators: The basic idea is that the parameter estimates with OLS are consistent even if homogeneity and no within correlation are violated, but the OLS estimate of the cov_params is wrong. robust cov estimators calculate the correct cov_params that is robust to this violation of uncorrelated and identical error terms. http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors I don't know if there is a special case if there are only two factors levels (never heard about that either). |
I made lots of starts on this, I translated a matlab module, that looks like it covers quite a lot, but I never was sure whether I got the pieces right (and I guess I didn't get all the details). The problem I have is that I have a reasonably good background in panel data analysis in econometrics, but not in the statistics approach to it. I still haven't really figured out REML, what's noise and what's the real underlying structure. My impression is that unless you have a good overview over the models, or a very good reference, it's better to start with one specific model and get that to work and then expand once the structure is clearer. (At least that would apply to me.) |
this makes sense, I fully agree with what you say, let's go specific. we should try to find one good reference and one nice dataset with sufficient observations (long panel) and at least two repeated factors (more ore less balanced). On 11.02.2013, at 01:47, Josef Perktold notifications@github.com wrote:
|
we don't have any REML in statsmodels, at least not by that name, except is in a few parts of OneWayMixed. I tried a few more ways with OneWayMixed but didn't get any closer (but managed to get a segfault with linalg). roughly looking at the difference in the underlying stochastic structure factor random effects: block structure don't edit in internet explorer, it eats half of your text when github insists on silly popups: |
Btw. I just found this in NiPy https://github.com/nipy/nipy/blob/master/nipy/modalities/fmri/spm/reml.py
wrong browser? ^^
I have to admit that the data model of OneWayMixed seems a bit exceptional to me. Was constructing lists of units the old way of assembling the data structures required? I'm having a hard time getting used to this representation. |
Splitting it up into Units is still Jonathan Taylor's original structure. It's a bit unfamiliar, but it has the advantage of keeping the calculation for each unit together. The interface will definitely change to (endog, exog, groups) where groups are the factor for the units. However, I didn't want to try to change the internal structure, until I/we have a test case and unit tests. Here two version I used to try it out yesterday I checked the reference again As I started to realize yesterday, the current implementation has unconstrained covariances of random factors. I still don't know why we don't get the random effects in the Machines example, Laird et al. describe one special case, where the parameter estimates of the mixed model coincide with OLS. (balanced, and ....) |
about the reml in nipy: If this works or can be made to work as standalone function, then it would be a nice function to plug into other models. |
the reml function, triggered my memory a bit more. On the very abstract level we just have a model y = x beta + u, u distributed N(0, V(z, theta)) We can just use iterated or two-stage GLS or maximum likelihood. my attempt for this for the flat panel case with block structure as in the repeated measures models is in "statsmodels\sandbox\panel\panel_short.py" |
Maybe I got you now as confused or lost as I sometimes feel. (Is it possible to see the forest and the trees at the same time?) As I mentioned before, maybe you want to get started with a model that is close to your usecase and with a good reference, and implement that. |
... just to add another tree to our forest, also NiPy: Thanks for the examples posted. So If get it right units are the 'entities' which get repeatedly observed? |
The smallest class of models that cover your cases, unless this is the entire forest ;) just as example: If all your explanatory variables in the random effects Z are categorical (no time trend or continuous variables) and you assume the individual random effects are independent from each other (diagonal D in OneWayMixed, if I read it correctly), then you are in the case of what I called factor random structure, which looks also to be the case for the nipy mixed_effects_stat.py from the module docstring. That's a narrower class from taking various covariance structures in the random effects into account. In OneWayMixed, |
just one more (while shutting down open windows and tabs) http://lme4.r-forge.r-project.org/slides/2011-03-16-Amsterdam/4Theory.pdf page 7 |
... good you name it. Today I fully the above cited 2008 paper, the linked vignettes here as well as some parts of the lme4 sources (ironically the C/C++ is easier to read than R). I had the impression that the Cholsky factorization might be one key ingredient. |
I don't recall all the details (IANAL)... I don't think we could distribute it because it's GPL, but we can certainly use it, if available. |
it's BSD but has GPL ingredients. A weak dependency could work. What about |
I think GPL infection is pretty strict, we still need to have an alternative implementation. It's not LGPL where it's allowed to link against. scipy.linalg is all (or almost all) dense, sparse support has improved in scipy, especially since scikit-learn, and I haven't kept up with it, but I there was nothing in scipy.sparse that could be used directly. |
I think in scikit-learn there a Cholesky factorization is implemented in
On Mon, Feb 11, 2013 at 7:31 PM, Josef Perktold notifications@github.comwrote:
|
It might not be so difficult to implement it directly since it's for a specific use case. But I never looked very closely at it. Once, I tried for a different kind of mixed model, with several random factors and not the repeated measures block structure. In that case the inverse covariance matrix was not sparse and I gave up with the generic approach. |
I thought scikit-learn might have sparse cholesky, but I didn't see it. |
it's more context-related, no API or so. Implemented on demand, if you wish. On 11.02.2013, at 20:13, Skipper Seabold notifications@github.com wrote:
|
before you go hunting for everything Cholesky ;) |
Short update on lmr sources studies:
|
and
This seems to be a nice blog: |
to add one more tree, since I just saw a reference: what I called factor random effects are usually in the literature as variance components mixed or random effects models. |
On 14.02.2013, at 19:24, Josef Perktold notifications@github.com wrote:
ok, good to have that link, this binds a few trees ;-)
|
Is there anyone working on this currently? I happens to work a lot with mixed model with correlated errors and would like to give it a shot. |
Yes, this was merged a few years ago and is available as the MixedLM class. Please try it out and open an issue if needed or ask a question at https://groups.google.com/forum/#!forum/pystatsmodels. |
Hmm...Maybe I missed something but I thought the current MixedLM supports random effect but do not allow specifying error covariance structure, like the lme(Y ~ X, random=~1 | id, correlation=corCompSymm(form=~1|id), method="ML", data=data) |
I don't know the lme syntax very well, but I believe the model you have specified using corCompSymm should be equivalent to a random intercept model, which we can certainly handle. But we cannot handle time-series like dependence structures such as AR. I would like to do that at some point but no time now (it is not in lme4 either AFAIK). Our implementation is more similar to lme4 than to lme/nlme. Again, AFAIK we can handle the full range of model specifications supported by lme4 (and even a few that I do not know how to specify in lme4). Our vc syntax is quite general albeit somewhat harder to use. Also, our code will run moderately/dramatically slower depending on the specifics. |
How is this evolving in the moment? Looking at the sandbox there does not seem to be too much change.
I've got a couple of use cases for which I ideally need to report 'repeated measures' analysis
Also for quite some time I've been thinking about implementing a general solution applicable to to cognitive neuroscience / physiological data.
So typically it should handle scenarios like: Y ~ within_a x within_b x between_j. Y can be button press latencies, error rates, etc., or brain responses. In that case Y has to be estimated at location l, time t (and frequency f).
I might give this a shot if no one is deeply devoted to this during the next weeks or so.
The text was updated successfully, but these errors were encountered: