-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proportional hazards survival regression model (a.k.a. Cox model) #1312
Conversation
Thanks Kerby, Actually we had a student that worked for a while on survival models a few years ago. As far as I can tell, the latest version is in Skipper's fork, it has Kaplan-Meier and some CoxPH and relate we have "Accelerated Failure Time (AFT) Model with empirical likelihood inference" I need to see how we can integrate and merge these parts and yours so we have a common base to figure out the missing pieces. |
I'll have some comments on this based on my survival branch when I have a chance to come up for air. |
Do you think maybe you could dump all the results file into python classes? Kind of picky, but it'd be easier to look over the substantive changes. FYI, we have some tools for dumping from R to Python. https://github.com/statsmodels/statsmodels/tree/master/tools/R2nparray |
FYI, the existing survival work is here. https://github.com/jseabold/statsmodels/tree/survival My plan is to look over yours and then try to rectify the two. Big bonus that you've dog-fooded this code for your own work. |
I rewrote the testing files using r2numpy to put the R outputs into python classes. It was a good suggestion, I think the test script is cleaner now. |
Ok, I've found some old code I hadn't committed yet and got mine running again. This code was salvaged from a broken back-up so it still needs a little work. Right now I have a mostly working KaplanMeier and CoxRegression with a few more results than you have available, but the parameters and like match. I pushed a branch with both implementations for comparison. The biggest thing I was trying to work on was to make the loops over groups general, so we can re-use it elsewhere. What do you think? Is it worth trying to keep this general (and for internal-use only, so we can always refactor). Right now I'm not sure the added complexity is worth it. (Ie., I fear that I will likely remain the only one maintaining it similar to what's happened with the general tsa stuff.) Basically, this kind of stuff gives you the ability to get groups, iterate over groups, etc. in a general way vs. your hard-coded groups right now. Mainly just asking about the idea rather than the implementation. I'll probably refactor this. https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/base/data.py#L392 Usage in the model https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/survival2.py#L778 See the Here's an example usage. You can see what's available in
So here's some concrete questions:
Other than that, things are pretty much the same and just need to pick an implementation. |
I will also be maintaining this and go over it to understand it enough (except for maybe the pandas parts where I don't understand enough). |
So that's a vote for trying to generalize the grouping? You know better more of the use cases than I do right now. I'll see about cleaning up the implementation and comparing to what's already available in the panel one. |
Skipper can you open a PR so we get the TravisCI feedback. automatically converting groups to GroupedData and converting to 2d, might cause problems in other code. |
It's not ready for a PR. As I mentioned, I wouldn't look at the implementation much right now. It's a mess and has some fundamental problems that need re-thinking in model instantiation with groups. Also, this isn't used anywhere else, so there won't be any problems elsewhere. Or do you mean eventually? |
I thought about saying something but I don't know the code yet. I'm all in favor of making the support, like group handling, more general so it can be reused across models, panel, GEE, cluster robust, ... However, I have an idea where we need, but I don't have enough overview to write a common "specification". |
make a WIP PR, you get TravisCI to check for you, and it will be easier to add comments for us the jseabold/statsmodels@master...phreg-survival-mixed#diff-2e18ce91dae3e3e78a23ba177fdfce58R532 would affect all models, if I interpret the changeset correctly |
I'll bet that the pandas grouphandling in GroupData is very inefficient, especially if the data array doesn't have a nice structure, and we need to recreate the groupspecific data arrays each time. That would be expensive in an optimization loop. (*) (*) we should watch out for memory layout. If we index into the original dataframe, then the memory layout of it can be very different and indexing could have to recreate a new array each time. |
I like the grouping code and would be happy to update GEE to use it. I am indifferent to which Cox implementation is used, but will be very happy to see one or the other merged since I use it all the time. One implementation issue is the use of broadcasting in the Hessian calculation. If I follow your implementation correctly there is an extra layer of looping in the hessian calculation where mine uses broadcasting. Regarding implementation for time dependent covariates, your approach of splitting into intervals where the covariates are constant is nice and straightforward. There is also an approach that directly calculates the partial likelihood in terms of covariate histories for other subjects in the risk set. I haven't thought through this enough, but I think there may be a way to optimize your approach by dropping intervals that don't contain events for other subjects. |
I would prefer to see some timings before doing that, especially for large number of groups (a few thousand). to partially summarize my view on groups
|
Question where I don't know enough about pandas: If pandas has/had fast paths for groupby depending on the layout of the dataframe, then it would be possible to rearrange the DataFrame to take advantage of it instead of using the user provided raw DataFrame. Or, alternatively require users to provide a DataFrame in the right structure. |
If someone has, or generates, example data with many groups/strata that would be helpful for profiling. I'd be more interested in real application data though. Aside, grouped data in the survival analysis literature means discrete data for which the event happens during an interval at that's how it's recorded, so I think strata is the preferable keywod here. |
(got sidetracked with other issues) The timing for groups is more relevant in short/flat panel data (than for a few stratas). I checked the "docvisits" data that I was using for GMM, but it only has around 4000 observations in total. I don't know if we have a larger dataset. (such as PSID or similar microeconometrics data) |
…phreg Conflicts: statsmodels/sandbox/phreg.py statsmodels/sandbox/tests/survival_r_results.py statsmodels/sandbox/tests/test_phreg.py
duplicate commits on june 30th, all except first, and last two (without merge) |
@kshedden I'm looking at the rebased version. Don't make anymore changes in this branch/PR. I will merge the rebased branch #1825 after moving the files to |
To that error, I think there is just a |
However that shouldn't work, dictionary.values are not sorted in any deterministic way. |
We don't need the collapsed gradients in any particular order since we are On Fri, Jul 11, 2014 at 10:38 PM, Josef Perktold notifications@github.com
|
good, I didn't read the next lines to see what is done with it. I pushed the list fix, and waiting for TravisCI. in general: dot it faster than * and sum. |
ENH: Cox Proportional Hazard Model, Phreg rebased closes statsmodels#1312
This is an implementation of Cox-type survival models. I put it in the sandbox along with the existing cox.py script. I'm not sure if there was a plan to develop cox.py into a mature implementation; the code I am submitting for review here is a fresh start. It is based on code I have been using for several years.
There are about 40 tests against R (coxph from the survival library), all of which pass.
This implementation handles left truncation (known in SAS as "entry times"), and stratification. Ties can be handled using either the Efron or Breslow method.
Only the parameter estimates and standard errors are provided. Various other things like concordance indices and pseudo-R^2 measures could be added at some point.
This implementation does not allow time-varying covariates. I think that should be a separate implementation. I have some working code for time-varying covariates, but it needs some cleanup and I won't get to it right away.
Comments and suggestions are welcome.