-
Notifications
You must be signed in to change notification settings - Fork 100
/
zph.Rnw
407 lines (368 loc) · 15.4 KB
/
zph.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
\section{The cox.zph function}
The simplest test of proportional hazards is to use a time dependent
coefficient $\beta(t) = a + bt$.
Then $\beta(t) x = ax + b*(tx)$, and the extended coefficients $a$ and $b$
can be obtained from a Cox model with an extra 'fake' covariate $tx$.
More generally, replace $t$ with some function $g(t)$, which gives rise to
an entire family of tests.
An efficient assessment of this extended model can be done using a score
test.
\begin{itemize}
\item Augment the original variables $x_1, \ldots x_k$ with $k$ new ones
$g(t)x_1, \ldots, g(t)x_k$
\item Compute the first and second derivatives $U$ and $H$ of the Cox model
at the starting estimate of $(\hat\beta, 0)$; prior covariates at their
prior values, and the new covariates at 0. No iteration is done.
This can be done efficiently with a modified version of the primary C routines
for coxph.
\item By design, the first $k$ elements of $U$ will be zero. Thus the
first iteration of the new coefficients, and the score tests for them, are
particularly easy.
\end{itemize}
The information or Hessian matrix for a Cox model is
$$ \sum_{j \in deaths} V(t_j) = \sum_jV_j$$
where $V_j$ is the variance matrix of the weighted covariate values, over
all subjects at risk at time $t_j$.
Then the expanded information matrix for the score test is
\begin{align*}
H &= \left(\begin{array}{cc} H_1 & H_2 \\ H_2' & H_3 \end{array} \right) \\
H_1 &= \sum V(t_j) \\
H_2 &= \sum V(t_j) g(t_j) \\
H_3 &= \sum V(t_j) g^2(t_j)
\end{align*}
The inverse of the matrix will be more numerically stable if $g(t)$ is centered
at zero, and this does not change the test statistic.
In the usual case $V(t)$ is close to constant in time --- the variance of
$X$ does not change rapidly --- and then $H_2$ is approximately zero.
The original cox.zph used an approximation, which is to assume that
$V(t)$ is exactly constant.
In that case $H_2=0$ and $H_3= \sum V(t_j) \sum g^2(t_j)$ and the test
is particularly easy to compute.
This assumption of identical components can fail badly for models with a
covariate by strata interaction, and for some models with covariate
dependent censoring.
Multi-state models finally forced a change.
The newer version of the routine has two separate tracks: for the formal test
and another for the residuals.
<<cox.zph>>=
cox.zph <- function(fit, transform='km', terms=TRUE, singledf =FALSE,
global=TRUE) {
Call <- match.call()
if (!inherits(fit, "coxph") && !inherits(fit, "coxme"))
stop ("argument must be the result of Cox model fit")
if (inherits(fit, "coxph.null"))
stop("there are no score residuals for a Null model")
if (!is.null(attr(terms(fit), "specials")[["tt"]]))
stop("function not defined for models with tt() terms")
if (inherits(fit, "coxme")) {
# drop all mention of the random effects, before getdata
fit$formula <- fit$formula$fixed
fit$call$formula <- fit$formula
}
cget <- coxph.getdata(fit, y=TRUE, x=TRUE, stratax=TRUE, weights=TRUE)
y <- cget$y
ny <- ncol(y)
event <- (y[,ny] ==1)
if (length(cget$strata))
istrat <- as.integer(cget$strata) - 1L # number from 0 for C
else istrat <- rep(0L, nrow(y))
# if terms==FALSE the singledf argument is moot, but setting a value
# leads to a simpler path through the code
if (!terms) singledf <- FALSE
<<zph-setup>>
<<zph-transform>>
<<zph-terms>>
<<zph-schoen>>
rval$transform <- tname
rval$call <- Call
class(rval) <- "cox.zph"
return(rval)
}
print.cox.zph <- function(x, digits = max(options()$digits - 4, 3),
signif.stars=FALSE, ...) {
invisible(printCoefmat(x$table, digits=digits, signif.stars=signif.stars,
P.values=TRUE, has.Pvalue=TRUE, ...))
}
@
The user can use $t$ or $g(t)$ as the multiplier of the covariates.
The default is to use the KM, only because that seems to be best at
avoiding edge cases.
<<zph-transform>>=
times <- y[,ny-1]
if (is.character(transform)) {
tname <- transform
ttimes <- switch(transform,
'identity'= times,
'rank' = rank(times),
'log' = log(times),
'km' = {
temp <- survfitKM(factor(rep(1L, nrow(y))),
y, se.fit=FALSE)
# A nuisance to do left continuous KM
indx <- findInterval(times, temp$time, left.open=TRUE)
1.0 - c(1, temp$surv)[indx+1]
},
stop("Unrecognized transform"))
}
else {
tname <- deparse(substitute(transform))
if (length(tname) >1) tname <- 'user'
ttimes <- transform(times)
}
gtime <- ttimes - mean(ttimes[event])
# Now get the U, information, and residuals
if (ny==2) {
ord <- order(istrat, y[,1]) -1L
resid <- .Call(Czph1, gtime, y, X, eta,
cget$weights, istrat, fit$method=="efron", ord)
}
else {
ord1 <- order(-istrat, -y[,1]) -1L # reverse time for zph2
ord <- order(-istrat, -y[,2]) -1L
resid <- .Call(Czph2, gtime, y, X, eta,
cget$weights, istrat, fit$method=="efron",
ord1, ord)
}
@
The result has a score vector of length $2p$ where $p$ is the number of
variables and an information matrix that is $2p$ by $2p$.
This is done with C code that
is a simple variation on iteration 1 for a coxph model.
If \code{singledf} is TRUE then treat each term as a single degree of
freedom test, otherwise as a multi-degree of freedom.
If terms=FALSE test each covariate individually.
If all the variables are univariate this is a moot point.
The survival routines return Splus style assign components, that is a list
with one element per term, each element an integer vector of coefficient
indices.
The asgn vector is our main workhorse: loop over asgn to process term by
term.
\begin{itemize}
\item if term=FALSE, set make a new asgn with one coef per term
\item if a coefficient is NA, remove it from the relevant asgn vector
\item frailties and penalized coxme coefficients are ignored: remove
their element from the asgn list
\end{itemize}
For random effects models, including both frailty and coxme results, the
random effect is included in the linear.predictors component of the
fit. This allows us to do score tests for the other terms while effectively
holding the random effect fixed.
If there are any NA coefficients these are redundant variables. It's
easiest to simply get rid of them at the start by fixing up X, varnames,
asgn, and nvar.
<<zph-setup>>=
eta <- fit$linear.predictors
X <- cget$x
varnames <- names(fit$coefficients)
nvar <- length(varnames)
if (!terms) {
# create a fake asgn that has one value per coefficient
asgn <- as.list(1:nvar)
names(asgn) <- names(fit$coefficients)
}
else if (inherits(fit, "coxme")) {
asgn <- attrassign(cget$x, terms(fit))
# allow for a spelling inconsistency in coxme, later fixed
if (is.null(fit$linear.predictors))
eta <- fit$linear.predictor
fit$df <- NULL # don't confuse later code
}
else asgn <- fit$assign
if (!is.list(asgn)) stop ("unexpected assign component")
frail <- grepl("frailty(", names(asgn), fixed=TRUE)
if (any(frail)) {
dcol <- unlist(asgn[frail]) # remove these columns from X
X <- X[, -dcol, drop=FALSE]
asgn <- asgn[!frail]
# frailties don't appear in the varnames, so no change there
}
nterm <- length(asgn)
termname <- names(asgn)
if (any(is.na(fit$coefficients))) {
keep <- !is.na(fit$coefficients)
varnames <- varnames[keep]
X <- X[,keep]
# fix up assign
new <- unname(unlist(asgn))[keep] # the ones to keep
asgn <- sapply(asgn, function(x) {
i <- match(x, new, nomatch=0)
i[i>0]})
asgn <- asgn[sapply(asgn, length)>0] # drop any that were lost
termname <- names(asgn)
nterm <- length(asgn) # asgn will be a list
nvar <- length(new)
}
@
The zph1 and zph2 functions do not consider penalties, so we need to add
those back in after the call.
Nothing needs to be done wrt the first derivative: we already ignore the
first ncoef elements of the returned first derivative (u) vector, which would
have had a penalty. The second portion of u is for beta=0, and all of the
penalties that currently are implemented have first derivative 0 at 0.
For the second derivative, the current penalties (frailty, rigde, pspline) have
a second derivative penalty that is independent of beta-hat.
The coxph result contains the numeric value of the penalty at the solution,
and we use a score test that would penalize the new time*pspline() term in
the same way as the pspline term was penalized.
If no coefficients were missing then allvar will be 1:n, otherwise it
will have holes.
<<zph-terms>>=
test <- double(nterm+1)
df <- rep(1L, nterm+1)
u0 <- rep(0, nvar)
if (!is.null(fit$coxlist2)) { # there are penalized terms
pmat <- matrix(0., 2*nvar, 2*nvar) # second derivative penalty
pmat[1:nvar, 1:nvar] <- fit$coxlist2$second
pmat[1:nvar + nvar, 1:nvar + nvar] <- fit$coxlist2$second
imatr <- resid$imat + pmat
}
else imatr <- resid$imat
for (ii in 1:nterm) {
jj <- asgn[[ii]]
kk <- c(1:nvar, jj+nvar)
imat <- imatr[kk, kk]
u <- c(u0, resid$u[jj+nvar])
if (singledf && length(jj) >1) {
vv <- solve(imat)[-(1:nvar), -(1:nvar)]
t1 <- sum(fit$coef[jj] * resid$u[jj+nvar])
test[ii] <- t1^2 * (fit$coef[jj] %*% vv %*% fit$coef[jj])
df[ii] <- 1
}
else {
test[ii] <- drop(solve(imat,u) %*% u)
if (is.null(fit$df)) df[ii] <- length(jj)
else df[ii] <- fit$df[ii]
}
}
#Global test
if (global) {
u <- c(u0, resid$u[-(1:nvar)])
test[nterm+1] <- solve(imatr, u) %*% u
if (is.null(fit$df)) df[nterm+1] <- nvar
else df[nterm+1] <- sum(fit$df)
tbl <- cbind(test, df, pchisq(test, df, lower.tail=FALSE))
dimnames(tbl) <- list(c(termname, "GLOBAL"), c("chisq", "df", "p"))
}
else {
tbl <- cbind(test, df, pchisq(test, df, lower.tail=FALSE))[1:nterm,, drop=FALSE]
dimnames(tbl) <- list(termname, c("chisq", "df", "p"))
}
# The x, y, residuals part is sorted by time within strata; this is
# what the C routine zph1 and zph2 return
indx <- if (ny==2) ord +1 else rev(ord) +1 # return to 1 based subscripts
indx <- indx[event[indx]] # only keep the death times
rval <- list(table=tbl, x=unname(ttimes[indx]), time=unname(y[indx, ny-1]))
if (length(cget$strata)) rval$strata <- cget$strata[indx]
@
The matrix of scaled Schoenfeld residuals is created one stratum at a
time.
The ideal for the residual $r(t_i)$, contributed by an event for subject
$i$ at time $t_i$ is to use $r_iV^{-1}(t_i)$, the inverse of the variance
matrix of $X$ at that time and for the relevant stratum.
What is returned as \code{resid\$imat} is $\sum_i V(t_i)$.
One option would have been to return all the individual $\hat V_i$ matrices,
but that falls over when the number at risk is too small and it cannot
be inverted.
Option 2 would be to use a per stratum averge of the $V_i$, but that falls
flat for models with a large number of strata, a nested case-control model
for instance.
We take a different average that may not be the best, but seems to be
good enough and doesn't seem to fail.
\begin{enumerate}
\item The \code{resid\$used} matrix contains the number of deaths for
each strata (row) that contributed to the sum for each variable (column).
The value is either 0 or the number of events in the stratum, zero for those
variables that are constant within the stratum. From this we can get the
number of events that contributed to each element of the \code{imat} total.
Dividing by this gives a per-element average \code{vmean}.
\item For a given stratum, some of the covariates may have been unused. For
any of those set the scaled Schoenfeld residual to NA, and use the other
rows/columns of the \code{vmean} matrix to scale the rest.
\end{enumerate}
Now if some variable $x_1$ has a large variance at some time points and a
small variance at others, or a large variance in one stratum and a small
variance in another, the above smoothing won't catch that subtlety.
However we expect such an issue to be rare.
The common problem of strata*covariate interactions is the target of the
above manipulations.
<<zph-schoen>>=
# Watch out for a particular edge case: there is a factor, and one of the
# strata happens to not use one of its levels. The element of resid$used will
# be zero, but it really should not.
used <-resid$used
for (i in asgn) {
if (length(i) > 1 && any(used[,i] ==0))
used[,i] <- apply(used[,i,drop=FALSE], 1, max)
}
# Make the weight matrix
wtmat <- matrix(0, nvar, nvar)
for (i in 1:nrow(used))
wtmat <- wtmat + outer(used[i,], used[i,], pmin)
# with strata*covariate interactions (multi-state models for instance) the
# imatr matrix will be block diagonal. Don't divide these off diagonal zeros
# by a wtmat value of zero.
vmean <- imatr[1:nvar, 1:nvar, drop=FALSE]/ifelse(wtmat==0, 1, wtmat)
sresid <- resid$schoen
if (terms && any(sapply(asgn, length) > 1)) { # collase multi-column terms
temp <- matrix(0, ncol(sresid), nterm)
for (i in 1:nterm) {
j <- asgn[[i]]
if (length(j) ==1) temp[j, i] <- 1
else temp[j, i] <- fit$coefficients[j]
}
sresid <- sresid %*% temp
vmean <- t(temp) %*% vmean %*% temp
used <- used[, sapply(asgn, function(x) x[1]), drop=FALSE]
}
dimnames(sresid) <- list(signif(rval$time, 4), termname)
# for each stratum, rescale the Schoenfeld residuals in that stratum
sgrp <- rep(1:nrow(used), apply(used, 1, max))
for (i in 1:nrow(used)) {
k <- which(used[i,] > 0)
if (length(k) >0) { # there might be no deaths in the stratum
j <- which(sgrp==i)
if (length(k) ==1) sresid[j,k] <- sresid[j,k]/vmean[k,k]
else sresid[j, k] <- t(solve(vmean[k, k], t(sresid[j, k, drop=FALSE])))
sresid[j, -k] <- NA
}
}
# Add in beta-hat. For a term with multiple columns we are testing zph for
# the linear predictor X\beta, which always has a coefficient of 1
for (i in 1:nterm) {
j <- asgn[[i]]
if (length(j) ==1) sresid[,i] <- sresid[,i] + fit$coefficients[j]
else sresid[,i] <- sresid[,i] +1
}
rval$y <- sresid
rval$var <- solve(vmean)
@
<<cox.zph>>=
"[.cox.zph" <- function(x, ..., drop=FALSE) {
i <- ..1
if (!is.null(x$strata)) {
y2 <- x$y[,i,drop=FALSE]
ymiss <- apply(is.na(y2), 1, all)
if (any(ymiss)) {
# some deaths played no role in these coefficients
# due to a strata * covariate interaction, drop unneeded rows
z<- list(table=x$table[i,,drop=FALSE], x=x$x[!ymiss],
time= x$time[!ymiss],
strata = x$strata[!ymiss],
y = y2[!ymiss,,drop=FALSE],
var=x$var[i,i, drop=FALSE],
transform=x$transform, call=x$call)
}
else z<- list(table=x$table[i,,drop=FALSE], x=x$x, time= x$time,
strata = x$strata,
y = y2, var=x$var[i,i, drop=FALSE],
transform=x$transform, call=x$call)
}
else
z<- list(table=x$table[i,,drop=FALSE], x=x$x, time= x$time,
y = x$y[,i,drop=FALSE],
var=x$var[i,i, drop=FALSE],
transform=x$transform, call=x$call)
class(z) <- class(x)
z
}
@