Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proportional hazards survival regression model (a.k.a. Cox model) #1312

Closed
wants to merge 88 commits into from

Conversation

kshedden
Copy link
Contributor

This is an implementation of Cox-type survival models. I put it in the sandbox along with the existing cox.py script. I'm not sure if there was a plan to develop cox.py into a mature implementation; the code I am submitting for review here is a fresh start. It is based on code I have been using for several years.

There are about 40 tests against R (coxph from the survival library), all of which pass.

This implementation handles left truncation (known in SAS as "entry times"), and stratification. Ties can be handled using either the Efron or Breslow method.

Only the parameter estimates and standard errors are provided. Various other things like concordance indices and pseudo-R^2 measures could be added at some point.

This implementation does not allow time-varying covariates. I think that should be a separate implementation. I have some working code for time-varying covariates, but it needs some cleanup and I won't get to it right away.

Comments and suggestions are welcome.

@josef-pkt
Copy link
Member

Thanks Kerby,

Actually we had a student that worked for a while on survival models a few years ago.

As far as I can tell, the latest version is in Skipper's fork, it has Kaplan-Meier and some CoxPH
https://github.com/jseabold/statsmodels/compare/survival-rebased2#diff-2d2bf43dff2e46b0201e0ef4f155d8d4R697
I don't know what the status on that is @jseabold

and relate we have "Accelerated Failure Time (AFT) Model with empirical likelihood inference"
in statsmodels.emplike.aft_el

I need to see how we can integrate and merge these parts and yours so we have a common base to figure out the missing pieces.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling d831b94 on kshedden:phreg into fe6e688 on statsmodels:master.

@jseabold
Copy link
Member

I'll have some comments on this based on my survival branch when I have a chance to come up for air.

@jseabold
Copy link
Member

Do you think maybe you could dump all the results file into python classes? Kind of picky, but it'd be easier to look over the substantive changes. FYI, we have some tools for dumping from R to Python.

https://github.com/statsmodels/statsmodels/tree/master/tools/R2nparray

@jseabold
Copy link
Member

FYI, the existing survival work is here.

https://github.com/jseabold/statsmodels/tree/survival

My plan is to look over yours and then try to rectify the two. Big bonus that you've dog-fooded this code for your own work.

@kshedden
Copy link
Contributor Author

I rewrote the testing files using r2numpy to put the R outputs into python classes. It was a good suggestion, I think the test script is cleaner now.

@jseabold
Copy link
Member

Ok, I've found some old code I hadn't committed yet and got mine running again. This code was salvaged from a broken back-up so it still needs a little work. Right now I have a mostly working KaplanMeier and CoxRegression with a few more results than you have available, but the parameters and like match. I pushed a branch with both implementations for comparison.

The biggest thing I was trying to work on was to make the loops over groups general, so we can re-use it elsewhere. What do you think? Is it worth trying to keep this general (and for internal-use only, so we can always refactor). Right now I'm not sure the added complexity is worth it. (Ie., I fear that I will likely remain the only one maintaining it similar to what's happened with the general tsa stuff.) Basically, this kind of stuff gives you the ability to get groups, iterate over groups, etc. in a general way vs. your hard-coded groups right now. Mainly just asking about the idea rather than the implementation. I'll probably refactor this.

https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/base/data.py#L392

Usage in the model

https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/survival2.py#L778
https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/survival2.py#L684

See the group_by_groups method there. It's also used in the plotting.

Here's an example usage. You can see what's available in res_ss, though it's hacked together right now and a lot of it doesn't work but only needs minor fixes. The main additions are plotting functions and baseline statistics from a Kaplan-Meier estimator.

https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/examples/ex_cox_kshedden.py

res_ss.baseline               res_ss.params
res_ss.baseline_object        res_ss.plot
res_ss.bse                    res_ss.plot_baseline
res_ss.conf_int               res_ss.predict
res_ss.cov_params             res_ss.pvalues
res_ss.deviance_plot          res_ss.remove_data
res_ss.diagnostics            res_ss.save
res_ss.exog_mean              res_ss.scale
res_ss.f_test                 res_ss.scheonfeld_plot
res_ss.initialize             res_ss.score_test
res_ss.likelihood_ratio_test  res_ss.summary
res_ss.llf                    res_ss.t_test
res_ss.load                   res_ss.test_coefficients
res_ss.martingale_plot        res_ss.tvalues
res_ss.model                  res_ss.use_t
res_ss.normalized_cov_params  res_ss.wald_test

So here's some concrete questions:

  • Do we want to generalize the grouping? There is grouping code in my branch, but it needs to be compared to the Grouping mechanism in WIP: Add Panel Data models #1133 and see what we can share.
  • Naming conventions. groups vs. strata? I don't have a preference here.
  • Does the user pass in a Survival object as in my code, or is it hidden as in your code? My preference is slightly for the latter, though I'm open to discussion. We've discussed going this direction for panel data, etc.

Other than that, things are pretty much the same and just need to pick an implementation.

@josef-pkt
Copy link
Member

I will also be maintaining this and go over it to understand it enough (except for maybe the pandas parts where I don't understand enough).

@jseabold
Copy link
Member

So that's a vote for trying to generalize the grouping? You know better more of the use cases than I do right now. I'll see about cleaning up the implementation and comparing to what's already available in the panel one.

@josef-pkt
Copy link
Member

Skipper can you open a PR so we get the TravisCI feedback.

automatically converting groups to GroupedData and converting to 2d, might cause problems in other code.
jseabold/statsmodels@master...phreg-survival-mixed#diff-2e18ce91dae3e3e78a23ba177fdfce58R408

@jseabold
Copy link
Member

It's not ready for a PR. As I mentioned, I wouldn't look at the implementation much right now. It's a mess and has some fundamental problems that need re-thinking in model instantiation with groups.

Also, this isn't used anywhere else, so there won't be any problems elsewhere. Or do you mean eventually?

@josef-pkt
Copy link
Member

So that's a vote for trying to generalize the grouping?

I thought about saying something but I don't know the code yet.
What I wanted to say is "Merge first, refactor later"

I'm all in favor of making the support, like group handling, more general so it can be reused across models, panel, GEE, cluster robust, ... However, I have an idea where we need, but I don't have enough overview to write a common "specification".
Also, Stratas here will only have a few groups, in panel, cluster microeconometrics we can have thousands of groups and I like to avoid group loops in that case.

@josef-pkt
Copy link
Member

It's not ready for a PR.

make a WIP PR, you get TravisCI to check for you, and it will be easier to add comments for us

the groups keyword is already in use, maybe only GEE in the official parts IIRC, maybe a few more cases in the sandbox.

jseabold/statsmodels@master...phreg-survival-mixed#diff-2e18ce91dae3e3e78a23ba177fdfce58R532 would affect all models, if I interpret the changeset correctly

@josef-pkt
Copy link
Member

I'll bet that the pandas grouphandling in GroupData is very inefficient, especially if the data array doesn't have a nice structure, and we need to recreate the groupspecific data arrays each time. That would be expensive in an optimization loop. (*)
I would prefer that we keep group handling out of base (for now) and subclasses call something like GroupedData directly.

(*) we should watch out for memory layout.
The advantage of list of arrays as in GEE (IIRC) is that it stores contiguous arrays that don't need to be rebuild or temporarily copied in functions like linear algebra routines (I guess). In panel hac I require that the layout is like in other packages, stacked time series for each individual, so we have fast access to contiguous slices (and views I guess).

If we index into the original dataframe, then the memory layout of it can be very different and indexing could have to recreate a new array each time.
That will not make a big difference in one time calculations, but it will become expensive in loops (like nonlinear optimization).
Also, optimizing the memory layout will not help if we don't have a predefined structure, as in the case when groups are categorical variables that could change, or when we have multiway groups without a fixed relative structure (e.g. panel with mostly time series effects or with mostly spatial/cross-section effects.) In the case of strata in survival and in repeated measures/longitudinal we have an unique or dominant structure that we should exploit.

@kshedden
Copy link
Contributor Author

I like the grouping code and would be happy to update GEE to use it.

I am indifferent to which Cox implementation is used, but will be very happy to see one or the other merged since I use it all the time.

One implementation issue is the use of broadcasting in the Hessian calculation. If I follow your implementation correctly there is an extra layer of looping in the hessian calculation where mine uses broadcasting.

Regarding implementation for time dependent covariates, your approach of splitting into intervals where the covariates are constant is nice and straightforward. There is also an approach that directly calculates the partial likelihood in terms of covariate histories for other subjects in the risk set. I haven't thought through this enough, but I think there may be a way to optimize your approach by dropping intervals that don't contain events for other subjects.

@josef-pkt
Copy link
Member

I like the grouping code and would be happy to update GEE to use it.

I would prefer to see some timings before doing that, especially for large number of groups (a few thousand).

to partially summarize my view on groups

  • I'd rather see Cox proportional hazard and panel merged without committing on a common, integrated specification for groups.
  • A general class for GroupedData as in base in Skipper's branch and as discussed and included in the panel branch will be useful for general use cases and as rapid development tool.
  • We need model specific GroupedData that can exploit specific data structures to be faster than the generic version. So models should be able to choose their GroupedData structure.
  • If we have a common interface to different Group classes then it will be easy to switch between them (as discussed in panel).
  • (maybe) It might be useful to separate the metadata (variable, observation, groupnames) as in base.data.GroupedData from the actual internal structure, so we can optimize the access to the raw data and still have access to the Data(Frame) information.

@josef-pkt
Copy link
Member

Question where I don't know enough about pandas:

If pandas has/had fast paths for groupby depending on the layout of the dataframe, then it would be possible to rearrange the DataFrame to take advantage of it instead of using the user provided raw DataFrame. Or, alternatively require users to provide a DataFrame in the right structure.

@jseabold
Copy link
Member

If someone has, or generates, example data with many groups/strata that would be helpful for profiling. I'd be more interested in real application data though.

Aside, grouped data in the survival analysis literature means discrete data for which the event happens during an interval at that's how it's recorded, so I think strata is the preferable keywod here.

@josef-pkt
Copy link
Member

(got sidetracked with other issues)

The timing for groups is more relevant in short/flat panel data (than for a few stratas). I checked the "docvisits" data that I was using for GMM, but it only has around 4000 observations in total. I don't know if we have a larger dataset. (such as PSID or similar microeconometrics data)

@josef-pkt josef-pkt mentioned this pull request Jul 10, 2014
25 tasks
@josef-pkt
Copy link
Member

duplicate commits on june 30th, all except first, and last two (without merge)

@josef-pkt josef-pkt mentioned this pull request Jul 11, 2014
@josef-pkt
Copy link
Member

@kshedden
Do you know what the error in the test is https://travis-ci.org/statsmodels/statsmodels/jobs/29700017 ?
errors only on the py 3.3 version

I'm looking at the rebased version.
After adding __init__.py to tests/results directory, the csv files get installed and TravisCI runs the tests.

Don't make anymore changes in this branch/PR. I will merge the rebased branch #1825 after moving the files to duration and rebase again, (and looking at the test failure).

@josef-pkt
Copy link
Member

To that error, I think there is just a list missing for dictionary.values. I'll try the fix and test.

@josef-pkt
Copy link
Member

However that shouldn't work, dictionary.values are not sorted in any deterministic way.
grads = np.asarray(list(grads.values()))

@kshedden
Copy link
Contributor Author

We don't need the collapsed gradients in any particular order since we are
just summing the squares.

On Fri, Jul 11, 2014 at 10:38 PM, Josef Perktold notifications@github.com
wrote:

However that shouldn't work, dictionary.values are not sorted in any
deterministic way.
grads = np.asarray(list(grads.values()))


Reply to this email directly or view it on GitHub
#1312 (comment)
.

@josef-pkt
Copy link
Member

good, I didn't read the next lines to see what is done with it. I pushed the list fix, and waiting for TravisCI.

in general: dot it faster than * and sum.

@josef-pkt josef-pkt closed this in 31e30a3 Jul 11, 2014
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014
ENH: Cox Proportional Hazard Model, Phreg rebased closes statsmodels#1312
@kshedden kshedden deleted the phreg branch September 27, 2014 14:22
@bashtage bashtage added the rejected Used for PRs with changes that are not acceptable label Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rejected Used for PRs with changes that are not acceptable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants