Proportional hazards survival regression model (a.k.a. Cox model) #1312

kshedden · 2014-01-11T14:54:45Z

This is an implementation of Cox-type survival models. I put it in the sandbox along with the existing cox.py script. I'm not sure if there was a plan to develop cox.py into a mature implementation; the code I am submitting for review here is a fresh start. It is based on code I have been using for several years.

There are about 40 tests against R (coxph from the survival library), all of which pass.

This implementation handles left truncation (known in SAS as "entry times"), and stratification. Ties can be handled using either the Efron or Breslow method.

Only the parameter estimates and standard errors are provided. Various other things like concordance indices and pseudo-R^2 measures could be added at some point.

This implementation does not allow time-varying covariates. I think that should be a separate implementation. I have some working code for time-varying covariates, but it needs some cleanup and I won't get to it right away.

Comments and suggestions are welcome.

josef-pkt · 2014-01-11T15:58:01Z

Thanks Kerby,

Actually we had a student that worked for a while on survival models a few years ago.

As far as I can tell, the latest version is in Skipper's fork, it has Kaplan-Meier and some CoxPH
https://github.com/jseabold/statsmodels/compare/survival-rebased2#diff-2d2bf43dff2e46b0201e0ef4f155d8d4R697
I don't know what the status on that is @jseabold

and relate we have "Accelerated Failure Time (AFT) Model with empirical likelihood inference"
in statsmodels.emplike.aft_el

I need to see how we can integrate and merge these parts and yours so we have a common base to figure out the missing pieces.

coveralls · 2014-01-12T09:23:53Z

Coverage remained the same when pulling d831b94 on kshedden:phreg into fe6e688 on statsmodels:master.

jseabold · 2014-01-14T14:52:28Z

I'll have some comments on this based on my survival branch when I have a chance to come up for air.

jseabold · 2014-01-17T16:50:55Z

Do you think maybe you could dump all the results file into python classes? Kind of picky, but it'd be easier to look over the substantive changes. FYI, we have some tools for dumping from R to Python.

https://github.com/statsmodels/statsmodels/tree/master/tools/R2nparray

jseabold · 2014-01-17T17:02:13Z

FYI, the existing survival work is here.

https://github.com/jseabold/statsmodels/tree/survival

My plan is to look over yours and then try to rectify the two. Big bonus that you've dog-fooded this code for your own work.

kshedden · 2014-01-19T02:57:40Z

I rewrote the testing files using r2numpy to put the R outputs into python classes. It was a good suggestion, I think the test script is cleaner now.

jseabold · 2014-01-22T17:44:37Z

Ok, I've found some old code I hadn't committed yet and got mine running again. This code was salvaged from a broken back-up so it still needs a little work. Right now I have a mostly working KaplanMeier and CoxRegression with a few more results than you have available, but the parameters and like match. I pushed a branch with both implementations for comparison.

The biggest thing I was trying to work on was to make the loops over groups general, so we can re-use it elsewhere. What do you think? Is it worth trying to keep this general (and for internal-use only, so we can always refactor). Right now I'm not sure the added complexity is worth it. (Ie., I fear that I will likely remain the only one maintaining it similar to what's happened with the general tsa stuff.) Basically, this kind of stuff gives you the ability to get groups, iterate over groups, etc. in a general way vs. your hard-coded groups right now. Mainly just asking about the idea rather than the implementation. I'll probably refactor this.

https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/base/data.py#L392

Usage in the model

https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/survival2.py#L778
https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/survival2.py#L684

See the group_by_groups method there. It's also used in the plotting.

Here's an example usage. You can see what's available in res_ss, though it's hacked together right now and a lot of it doesn't work but only needs minor fixes. The main additions are plotting functions and baseline statistics from a Kaplan-Meier estimator.

https://github.com/jseabold/statsmodels/blob/phreg-survival-mixed/statsmodels/sandbox/examples/ex_cox_kshedden.py

res_ss.baseline               res_ss.params
res_ss.baseline_object        res_ss.plot
res_ss.bse                    res_ss.plot_baseline
res_ss.conf_int               res_ss.predict
res_ss.cov_params             res_ss.pvalues
res_ss.deviance_plot          res_ss.remove_data
res_ss.diagnostics            res_ss.save
res_ss.exog_mean              res_ss.scale
res_ss.f_test                 res_ss.scheonfeld_plot
res_ss.initialize             res_ss.score_test
res_ss.likelihood_ratio_test  res_ss.summary
res_ss.llf                    res_ss.t_test
res_ss.load                   res_ss.test_coefficients
res_ss.martingale_plot        res_ss.tvalues
res_ss.model                  res_ss.use_t
res_ss.normalized_cov_params  res_ss.wald_test

So here's some concrete questions:

Do we want to generalize the grouping? There is grouping code in my branch, but it needs to be compared to the Grouping mechanism in WIP: Add Panel Data models #1133 and see what we can share.
Naming conventions. groups vs. strata? I don't have a preference here.
Does the user pass in a Survival object as in my code, or is it hidden as in your code? My preference is slightly for the latter, though I'm open to discussion. We've discussed going this direction for panel data, etc.

Other than that, things are pretty much the same and just need to pick an implementation.

josef-pkt · 2014-01-22T18:28:01Z

I will also be maintaining this and go over it to understand it enough (except for maybe the pandas parts where I don't understand enough).

jseabold · 2014-01-22T18:30:42Z

So that's a vote for trying to generalize the grouping? You know better more of the use cases than I do right now. I'll see about cleaning up the implementation and comparing to what's already available in the panel one.

josef-pkt · 2014-01-22T18:35:58Z

Skipper can you open a PR so we get the TravisCI feedback.

automatically converting groups to GroupedData and converting to 2d, might cause problems in other code.
jseabold/statsmodels@master...phreg-survival-mixed#diff-2e18ce91dae3e3e78a23ba177fdfce58R408

jseabold · 2014-01-22T18:39:01Z

It's not ready for a PR. As I mentioned, I wouldn't look at the implementation much right now. It's a mess and has some fundamental problems that need re-thinking in model instantiation with groups.

Also, this isn't used anywhere else, so there won't be any problems elsewhere. Or do you mean eventually?

josef-pkt · 2014-01-22T18:43:34Z

So that's a vote for trying to generalize the grouping?

I thought about saying something but I don't know the code yet.
What I wanted to say is "Merge first, refactor later"

I'm all in favor of making the support, like group handling, more general so it can be reused across models, panel, GEE, cluster robust, ... However, I have an idea where we need, but I don't have enough overview to write a common "specification".
Also, Stratas here will only have a few groups, in panel, cluster microeconometrics we can have thousands of groups and I like to avoid group loops in that case.

josef-pkt · 2014-01-22T18:47:44Z

It's not ready for a PR.

make a WIP PR, you get TravisCI to check for you, and it will be easier to add comments for us

the groups keyword is already in use, maybe only GEE in the official parts IIRC, maybe a few more cases in the sandbox.

jseabold/statsmodels@master...phreg-survival-mixed#diff-2e18ce91dae3e3e78a23ba177fdfce58R532 would affect all models, if I interpret the changeset correctly

josef-pkt · 2014-01-23T14:18:20Z

I'll bet that the pandas grouphandling in GroupData is very inefficient, especially if the data array doesn't have a nice structure, and we need to recreate the groupspecific data arrays each time. That would be expensive in an optimization loop. (*)
I would prefer that we keep group handling out of base (for now) and subclasses call something like GroupedData directly.

(*) we should watch out for memory layout.
The advantage of list of arrays as in GEE (IIRC) is that it stores contiguous arrays that don't need to be rebuild or temporarily copied in functions like linear algebra routines (I guess). In panel hac I require that the layout is like in other packages, stacked time series for each individual, so we have fast access to contiguous slices (and views I guess).

If we index into the original dataframe, then the memory layout of it can be very different and indexing could have to recreate a new array each time.
That will not make a big difference in one time calculations, but it will become expensive in loops (like nonlinear optimization).
Also, optimizing the memory layout will not help if we don't have a predefined structure, as in the case when groups are categorical variables that could change, or when we have multiway groups without a fixed relative structure (e.g. panel with mostly time series effects or with mostly spatial/cross-section effects.) In the case of strata in survival and in repeated measures/longitudinal we have an unique or dominant structure that we should exploit.

kshedden · 2014-01-23T14:27:53Z

I like the grouping code and would be happy to update GEE to use it.

I am indifferent to which Cox implementation is used, but will be very happy to see one or the other merged since I use it all the time.

One implementation issue is the use of broadcasting in the Hessian calculation. If I follow your implementation correctly there is an extra layer of looping in the hessian calculation where mine uses broadcasting.

Regarding implementation for time dependent covariates, your approach of splitting into intervals where the covariates are constant is nice and straightforward. There is also an approach that directly calculates the partial likelihood in terms of covariate histories for other subjects in the risk set. I haven't thought through this enough, but I think there may be a way to optimize your approach by dropping intervals that don't contain events for other subjects.

josef-pkt · 2014-01-23T15:06:37Z

I like the grouping code and would be happy to update GEE to use it.

I would prefer to see some timings before doing that, especially for large number of groups (a few thousand).

to partially summarize my view on groups

I'd rather see Cox proportional hazard and panel merged without committing on a common, integrated specification for groups.
A general class for GroupedData as in base in Skipper's branch and as discussed and included in the panel branch will be useful for general use cases and as rapid development tool.
We need model specific GroupedData that can exploit specific data structures to be faster than the generic version. So models should be able to choose their GroupedData structure.
If we have a common interface to different Group classes then it will be easy to switch between them (as discussed in panel).
(maybe) It might be useful to separate the metadata (variable, observation, groupnames) as in base.data.GroupedData from the actual internal structure, so we can optimize the access to the raw data and still have access to the Data(Frame) information.

josef-pkt · 2014-01-23T15:17:14Z

Question where I don't know enough about pandas:

If pandas has/had fast paths for groupby depending on the layout of the dataframe, then it would be possible to rearrange the DataFrame to take advantage of it instead of using the user provided raw DataFrame. Or, alternatively require users to provide a DataFrame in the right structure.

jseabold · 2014-01-23T18:26:26Z

If someone has, or generates, example data with many groups/strata that would be helpful for profiling. I'd be more interested in real application data though.

Aside, grouped data in the survival analysis literature means discrete data for which the event happens during an interval at that's how it's recorded, so I think strata is the preferable keywod here.

josef-pkt · 2014-01-26T13:42:06Z

(got sidetracked with other issues)

The timing for groups is more relevant in short/flat panel data (than for a few stratas). I checked the "docvisits" data that I was using for GMM, but it only has around 4000 observations in total. I don't know if we have a larger dataset. (such as PSID or similar microeconometrics data)

…phreg Conflicts: statsmodels/sandbox/phreg.py statsmodels/sandbox/tests/survival_r_results.py statsmodels/sandbox/tests/test_phreg.py

josef-pkt · 2014-07-11T05:05:22Z

duplicate commits on june 30th, all except first, and last two (without merge)

josef-pkt · 2014-07-11T14:30:11Z

@kshedden
Do you know what the error in the test is https://travis-ci.org/statsmodels/statsmodels/jobs/29700017 ?
errors only on the py 3.3 version

I'm looking at the rebased version.
After adding __init__.py to tests/results directory, the csv files get installed and TravisCI runs the tests.

Don't make anymore changes in this branch/PR. I will merge the rebased branch #1825 after moving the files to duration and rebase again, (and looking at the test failure).

josef-pkt · 2014-07-11T14:36:14Z

To that error, I think there is just a list missing for dictionary.values. I'll try the fix and test.

josef-pkt · 2014-07-11T14:38:45Z

However that shouldn't work, dictionary.values are not sorted in any deterministic way.
grads = np.asarray(list(grads.values()))

kshedden · 2014-07-11T14:47:56Z

We don't need the collapsed gradients in any particular order since we are
just summing the squares.

On Fri, Jul 11, 2014 at 10:38 PM, Josef Perktold notifications@github.com
wrote:

However that shouldn't work, dictionary.values are not sorted in any
deterministic way.
grads = np.asarray(list(grads.values()))

—
Reply to this email directly or view it on GitHub
#1312 (comment)
.

josef-pkt · 2014-07-11T15:02:11Z

good, I didn't read the next lines to see what is done with it. I pushed the list fix, and waiting for TravisCI.

in general: dot it faster than * and sum.

ENH: Cox Proportional Hazard Model, Phreg rebased closes statsmodels#1312

jseabold mentioned this pull request Feb 7, 2014

WIP: Kaplan Meier, Survival models #548

Closed

josef-pkt added the PR label Feb 19, 2014

kshedden added 9 commits May 20, 2014 03:26

phreg initial commit of working code

d6cb44d

Added summary; other fixes; ready for PR

c1f1c1a

Removed unneeded import that was breaking on travis

a2e8de4

Renamed R output files to .txt

47acdad

Moved test script to tests directoy

a19a454

Remove deleted files

10372db

Moved test generator to tests directory

b29fed5

Remove copy of test script

ee9bc86

FIxed import path

e0a9b7e

kshedden added 18 commits June 30, 2014 01:30

Added score_obs; various cleanups.

a40e403

Robust covariances

c926826

Added offset; cleaned up some indexing code

6064d8d

Change behavior for selecting covariance type

23321e1

Added Schoenfeld residuals

943a559

Moved groups argument to fit

911005b

Remove uninformative cases

d9076c1

Use seed when generating data for tests

4a000a5

Minor changes

27aa08a

Minor changes

70f6faf

Added predict method

38f0434

Added get_distribution

b8123c8

Added get_distribution

86784c5

Added fit_regularized

1b0d7ac

Merge branch 'phreg' of https://github.com/kshedden/statsmodels into …

957147e

…phreg Conflicts: statsmodels/sandbox/phreg.py statsmodels/sandbox/tests/survival_r_results.py statsmodels/sandbox/tests/test_phreg.py

Improved algorithm for fit_regularized

b919a90

Added test vs. R for fit_regularized

d98a29f

Forgot two files in last commit

2b80826

josef-pkt mentioned this pull request Jul 10, 2014

release 0.6 #912

Closed

25 tasks

josef-pkt mentioned this pull request Jul 11, 2014

Phreg 1312 rebased #1825

Merged

josef-pkt closed this in 31e30a3 Jul 11, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge pull request statsmodels#1825 from josef-pkt/phreg_1312_rebased

9aea814

ENH: Cox Proportional Hazard Model, Phreg rebased closes statsmodels#1312

kshedden deleted the phreg branch September 27, 2014 14:22

bashtage added the rejected Used for PRs with changes that are not acceptable label Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proportional hazards survival regression model (a.k.a. Cox model) #1312

Proportional hazards survival regression model (a.k.a. Cox model) #1312

kshedden commented Jan 11, 2014

josef-pkt commented Jan 11, 2014

coveralls commented Jan 12, 2014

jseabold commented Jan 14, 2014

jseabold commented Jan 17, 2014

jseabold commented Jan 17, 2014

kshedden commented Jan 19, 2014

jseabold commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

jseabold commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

jseabold commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

josef-pkt commented Jan 23, 2014

kshedden commented Jan 23, 2014

josef-pkt commented Jan 23, 2014

josef-pkt commented Jan 23, 2014

jseabold commented Jan 23, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

kshedden commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

Proportional hazards survival regression model (a.k.a. Cox model) #1312

Proportional hazards survival regression model (a.k.a. Cox model) #1312

Conversation

kshedden commented Jan 11, 2014

josef-pkt commented Jan 11, 2014

coveralls commented Jan 12, 2014

jseabold commented Jan 14, 2014

jseabold commented Jan 17, 2014

jseabold commented Jan 17, 2014

kshedden commented Jan 19, 2014

jseabold commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

jseabold commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

jseabold commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

josef-pkt commented Jan 23, 2014

kshedden commented Jan 23, 2014

josef-pkt commented Jan 23, 2014

josef-pkt commented Jan 23, 2014

jseabold commented Jan 23, 2014

josef-pkt commented Jan 26, 2014

josef-pkt commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

josef-pkt commented Jul 11, 2014

kshedden commented Jul 11, 2014

josef-pkt commented Jul 11, 2014