New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compositional Data Analysis modules #3763
base: main
Are you sure you want to change the base?
Conversation
It doesn't really fit into So, I think we should start a new The main thing we will eventually have to look out for are circular imports. But this is a bit easier for us because of our structure of empty I'm leaving for Europe in a few days and don't know when I will have time to look at it. I did enough background readings last time that I don't expect problems reviewing and merging this, although I won't try to understand all the details. |
unit tests fail because of a typo in "statsmodels"
I was still thinking about the directory name: I think the adjective "compositional" instead of "compositions" would be better. It sounds more like compositional data or models for compositional data. (e.g. circular versus circles) |
I think the name change makes sense. It'll also be good to have this not be confused with function compositions. The imports have been fixed - let's see how that compiles in travis. |
A semi-random thought: I expect that we need to do some generic refactoring for better support of multivariate models in late summer or early fall, After that it would be a good time to see what elements we need beyond multivariate linear/Gaussian models to support the transformed endog y. |
That's actually a good question - I'm not entirely sure what that would involve. If I had to hazard guess, I would think that the log-likelihood for the normal distribution can be completely recycled, since the But I'm not entirely sure - and I'm having trouble finding literature to back this up. |
Also, it looks like the tests are failing on python=2.7 tests. Having a bit of trouble reproducing these errors on my python=2.7 build. Is there an easy way to list the errors in the travis build? |
add the bottom of the test log is the link to the "raw log" file. We use verbose mode which produces test output that is too long for the Travis display maybe you forgot a
|
About the likelihood: One thing that wasn't clear to me is if we use the likelihood for the transformed variable (which they might call the Aitchinson measure) or whether we want the likelihood for the original data which would need the determinant of the Jacobian of the transformation. If we want to compare models with different transformations, then we would need the likelihood in the original endog y space. But this can also wait until we have some explicit models that use these transformations. From some applications that I have read about they just use standard multivariate methods designed for the normal distribution case. |
1 similar comment
I canceled and restarted two of the jobs on travis because they seemed to be stuck in the waiting line. All green now. The one appveyor fail is the test run getting stuck in statespace models, unrelated to this PR. |
For ilr normal models, it doesn't looks too bad to calculate the Jacobian Not sure what exactly this will involve when fitting this into the existing statsmodels framework - but I don't see any reason why we can't do this in another PR in the near future. |
@josef-pkt it looks like that statespace module is still failing - but there are still other PRs in statsmodels that are passing. Any idea what is going on? |
The statespace unit test have random functions that hang sometimes, in one out of three appveyor python in three quarters of the test runs (rough guess). It's somewhere inside linear algebra packages or in garbage collection, but it's essentially impossible to reproduce locally and we have no clear idea what might be the problem. We ignore those for now, and will eventually skip them on Windows. |
I was reading partially through this PR (without trying to check/understand the math) |
Awesome! Where is a good place to get started with that? Can definitely boot up another PR to handle the link functions. While we are on the topic, note that the multinomial logistic regression can be rephrased as a GLM with an alr link function. Similarly, a GLM with an ilr link function should boil down to simplicial linear regression. So maybe the GLM with a ilr link function should behave similarly to |
The connection to MNLogit is what triggered our initial discussion. I think it will be easier to focus initially on the compositional use case for e.g. ilr and alr link classes. Even if the structure and transformation is basically the same, there are differences in the details. I don't have a good enough overview of the common structure and different use cases or model types to be able to tell how a common generic design would look like at the current stage. One difference in the underlying model structure is that GLM, GEE and discrete models have a model for the conditional expectation of y: E(y | x) = link_inverse(x b), while the logistic normal model works with E(link(y) | x) = x b. So, I think what we need is something like ILRLink attached to a MultivariateOLS or VAR to transform endog for the estimation, and backtransform for results as needed e.g. |
@josef-pkt that sounds like a plan. I just pushed in the ilr link function (had to derive the derivative by hand -- but it seems to work). It would be totally awesome to have this linked to VAR! Haven't even considered that. It would also be awesome to have other multivariate regression methods with dependent responses, or multivariate response linear mixed effects models(but maybe a discussion for future PRs). Yes, I agree -- the link function should be able to specify the transformations back and forth. The only minor technical issue with this is that the ilr transform can't handle zeros. Question : should we enable the ilr link function for the Gaussian family? That way, we can enable Gaussian distributions in the simplex? Also, are preferred ways to test this code? I see that the Are there any other gapping holes that need to be addressed in this PR? |
I have too many different topics, but I'm still thinking on and off how to organize things. another possible renaming suggestion: I looked briefly at multivariate MultivariateOLS which is currently a stub model in support of MANOVA. I think we can create a subclass of it in |
Hi @josef-pkt how do you recommend moving forward with this module? Do you think the blocking factor here is the name of the module, or the In particularly, do you still want the link functions in this PR? It isn't clear how applicable these are until the Multivariate Regression objects get refactored. |
The main thing I thought at the end was confirming the name before merging. Link functions and usage can wait for another PR, and when we know more what we need in those cases like for MultivariateOLS or similar. BTW: I just spend a bit of time on SUR (seemingly unrelated regression, similar to MultivariateOLS but different regressors for different endog), and I should have a relatively compact version that should also allow for sparse or regularized correlation. (It is still a single estimation problem so cannot be very large.) |
Ok! Knocked out the ilr link function! Concerning the naming, that other thread mentioned having a tools folder. |
Codecov Report
@@ Coverage Diff @@
## master #3763 +/- ##
==========================================
+ Coverage 79% 79.37% +0.37%
==========================================
Files 541 550 +9
Lines 80886 82074 +1188
Branches 9191 9281 +90
==========================================
+ Hits 63901 65146 +1245
+ Misses 14907 14829 -78
- Partials 2078 2099 +21
Continue to review full report at Codecov.
|
comptools sounds fine to me (I thought also about transformation, but this contains additional "tools" code. I think it can be public, I don't remember if we had a discussion about leading underscores for this. One thing: move the imports from |
Ok - I just copied the imports from |
the api path is Once there are more functions/classes that users want to use with the general api, then the compositional.api will be imported into statsmodels/api.py like other subpackages, so we can also use
recommended path for library use that avoids loading unnecessary other parts is the explicit module path, e.g. (That's the pattern that we have in all main subpackages) |
1 similar comment
The never ending naming game, before names protected by backwards compatibility policy: I was thinking now of I think we can merge this soon (included for 0.9). (After a detour to some multivariate statistics in the last two-three weeks, I should be back to merging.) |
This addresses #3560 and #3537
Things that need to be touched up in this PR
@josef-pkt not sure exactly where this module should belong - do you have any better recommendations?