# Calibration



Calibration plot: $\hat{p}$ on horizontal, $p$ on vertical. Kernal regression or bins.

$$
\widehat{\mathbb{E}}[p] = \dfrac{1}{N} \sum_{i=1}^N d_i
$$


Hosmer–Lemeshow goodness-of-fit test



Lorenz curve/Gini coefficient: For a level $\hat{p}$ of predicted risk, there should be $p$ proportion of realized risk in the population


To condition on $\hat{p}$, we can use a kernel estimator,
$$
\widehat{\mathbb{E}}[p|\hat{p}] =  \dfrac{\dfrac{1}{N} \sum_{i=1}^N d_i \times k_h(p-\hat{p}_i)}{\dfrac{1}{N} \sum_{i=1}^N k_h(p-\hat{p}_i)} = \gamma(\hat{p})
$$


For linear regression:
$$
y' = a + b \times \hat{y}
$$


For logistic regression:
$$
\text{logit}( \hat{p}' ) = a + b \times x_i \hat{\beta}
$$


The coefficient $b$ is the **calibration slope**. If $\hat{b}<1$, then shrinkage of regression coefficients improves average performance on a validation set.

## Assessing Calibration Slope

Continuous:
- Calibration-in-the-large: Compute $y_{new} - \hat{y}$, $t$-test null of zero difference; i.e. regress residuals on a constant.
- Calibration slope: $y_{new} - \hat{y} = a + b \times \hat{y}$. Seeking $a=0$ and $b=1$.

For binary outcomes,
$$
OR = \dfrac{m(\hat{y})}{1- m(\hat{y})} \times \left[ \dfrac{m(y_{new})}{1-m(y_{new})} \right]
$$

In [None]:
$c$ statistic: concordance statistic

In [None]:
## Strong Calibration
- Essentially goodness-of-fit
- Lack of fit: Missed nonlinearities/interactions, inappropriate link function


Another approach to goodness-of-ﬁt is to study observed versus expected out-
comes in subgroups of patients, deﬁned by predictor values. For example, we can
assess the difference between observed versus expected outcomes in males and
females, or other subgroups of patients. If the effect of the subgroup is not well
modeled, e.g., an interaction was missed, this might be reﬂected in this assessment.
302 15 Evaluation of Performance
There are, however, more direct ways of assessing the inﬂuence of subgroup
characteristics, as was discussed in Chap. 13 on model speciﬁcation. So, this check
for calibration is also more for face validity of the model and for convincing
potential users than a serious check of calibration. Measures for assessment of
calibration are summarized in Table 15.9

# Kinds of Calibration:

| Type              | Definition | Test |
| :---------------- | :------: | ----: |
| Mean        |  $\widehat{\mathbb{E}}[p] = \dfrac{1}{N} \sum_{i=1}^N \hat{p}(x_i)$   | Test $\hat{a}=0$ assuming that $\hat{b}=1$ |
| Weak           |   No systematic mis-estimation of risks  | Estimate $(a,b)$, test whether $\hat{a}=0$ and $\hat{b}=1$ |
| Moderate |  $\widehat{\mathbb{E}}[p \| \hat{p}] = \hat{p}$,   | Calibration Curve |
| Strong |  $\widehat{\mathbb{E}}[p(x_i) \| \hat{p}(x_i), x_i] = \hat{p}(x_i)$,   | 42.99 |

[602]
[26]
[225]
[109,627]
[114] - Calibration slope
[257] - Hosmer-Lemeshow GOF test, [225] Harrell's E statistic, Estimated Calibration index (ECI) [621]
[602] Utopic model
[187, 324]  Goeman-Le Cessie GOF test: An interesting approach is the Goeman–Le Cessie goodness-of-ﬁt test [187,
324]. It assesses the alternative hypothesis that any nonlinearities or interaction
effects have been missed in a logistic regression model. Such neglected effects can
be detected by studying patterns in the residuals: observations close to each other in
covariate space which deviate from the model in the same direction.  The approach
is to smooth the regression residuals and to test whether these smoothed residuals
have more variance than expected under the null hypothesis. This deviation occurs
when residuals that are close together in the covariate space are correlated. The test
statistic is a sum of squared smoothed residuals.
[116] Survival Analysis: A calibration plot can also be produced. The calibration of a model can be
studied at ﬁxed time points. We can group patients for calculation of survival rates
with the Kaplan–Meier method. Harrell suggests to use at least 50 subjects per
group, depending on the hazard of the outcome [225]. This observed survival may
be compared to the mean predicted survival from the prediction model. A smoothed
calibration curve can be obtained by comparing Cox–Snell residuals on the
cumulative probability scale against the right-censored survival times [225]. We can
also plot the observed t-year risk of the outcome for each tenth of patients (and 95%
conﬁdence intervals) against the predicted risk estimated from the Poisson
regression model [116]. This model-based approach can be extended to replace the
groups with splines. These approaches depend on the baseline hazard being
available either for at least some speciﬁc time points [471].

[568] Calibration plot -> Validation plot

[629] The calibration slope, however, has
a direct mathematical relation with discrimination [629]. If the calibration slope is
below unity, the discrimination is also lower at external validation. Hence, over-
ﬁtted models will show both poor calibration and poor discrimination when vali-
dated in new patients (Chap. 19).

[432] Predictiveness curves


The framework of a recalibration model was already proposed by Cox [114], and
has been supported by many other researchers for evaluation of model performance
[109, 225, 379, 380, 626]. Nice illustrations of diagnostic test evaluation with ROC
curves are available at: http://www.anaesthetist.com/mnm/stats/roc/ and illustra-
tions of Lorenz curves and the Gini index are at: http://en.wikipedia.org/wiki/Gini_
coefﬁcient.






## Discrimination

Nagelkerke’s R2 can well be used [403], although
many alternatives are available, and some may prefer other definitions [7].



We consider a development sample containing 544 patients [551], and a validation
sample 273 patients treated at Indiana University Medical Center [644]. We
developed a logistic regression model with five predictors: teratoma elements in the
primary tumor, prechemotherapy levels of AFP and HCG, postchemotherapy mass
size, and reduction in mass size.





For example, the Brier score can formally be decomposed into
indicators of calibration and discrimination [54, 396].




The area under the curve (AUC) can be interpreted as the probability that a
patient with the outcome is given a higher probability of the outcome by the model
than a randomly chosen patient without the outcome [223]. An uninformative
model, such as a coin flip, will hence have an area of 0.5. A perfect model has an
area of 1. The AUC is usually the most important number from a ROC plot; the plot
itself suffers from instability and is rather meaningless if no thresholds are indicated
(Fig. 15.3, no thresholds, only the area is relevant versus Fig. 15.2, thresholds
added).






The Lorenz curve has been used in economics to characterize the
distribution of wealth in a population [351]

The discrimination slope is a simple measure for how well subjects with and
without the outcome are separated. Its use as a measure for discrimination is
attributed to Yates [684]. It is easily calculated as the absolute difference in average
predictions for those with and without the outcome BOXPLOT CONDITIONAL ON ACTUAL 0,1



Indeed, the interesting connection is that Pearson R2 is asymptotically equal to
the Yates slope. Improvements in Pearson R2 or in Yates slope are equivalent to the
integrated discrimination index (IDI) [426, 428, 568].



GINI:
For prediction models, we can plot the cumulative proportion of the population
on the x-axis, ranked by predicted probability. On the y-axis, we plot the cumu-
lative proportion of subjects with the outcome. For example, we can show the
proportion of subjects developing cancer against the cumulative proportion of the
population ranked by cancer risk [46]. In terms of ROC curves, we plot the
cumulative rate of false-negative classifications against the cumulative rate of
negative predictions. The ROC and Lorenz curves look somewhat similar, except
that the Lorenz curve is flipped vertically and horizontally. In case of a
non-informative model, a straight line arises, since every rate of the population
classified as negative corresponds to the same rate classified as negative among
those with the outcome. A good model has a curve under this straight line, with a
relatively large proportion of the population classified as negative having only a
small part of the outcomes (low false-negative rate). On the upper end of the x-axis,
a small part of the population should contain many subjects with the outcome. In
the ideal case, a cutoff is used that classifies the fraction as positive equal to the
prevalence, and all these have the outcome. Indeed, we note that a c statistic of 0.98
leads to a nearly horizontal line till the 50% cumulative proportion point on the
x-axis and increases more or less linearly to 100% after that.
The Gini index is sometimes calculated as a summary measure for the Lorenz
curve. The Gini index is the ratio between the area A between the Lorenz curve of
the prediction model and the line for a non-informative model and the area under
the line for a non-informative model (0.5). Hence, G = 2 * A.
Other summaries are related to quantiles of the cumulative




Table 15.5 Summary of some measures for discriminative ability of a prediction model for
binary outcomes


## BOOTSTRAP VALIDATION

For bootstrap validation a prediction model is developed in each bootstrap
sample. This model is evaluated both in the bootstrap sample and in the original
sample. The first reflects apparent validation, the second reflects validation in new
subjects. The difference in performance indicates the optimism. This optimism is
subtracted from the apparent performance of the original model in the original
sample [148, 225, 542, 547]. The bootstrap was illustrated for estimation of opti-
mism in Chap. 5.


