-
Notifications
You must be signed in to change notification settings - Fork 16
/
calibration.tex
1225 lines (1062 loc) · 54.8 KB
/
calibration.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\def\ParSkip{}
\input{../../common/ryantibs}
\usepackage[normalem]{ulem}
\usepackage{animate}
\title{Forecast Scoring and Calibration \\ \smallskip
\large Advanced Topics in Statistical Learning, Spring 2023 \\ \smallskip
Ryan Tibshirani }
\date{}
\begin{document}
\maketitle
\RaggedRight
\vspace{-50pt}
\section{Introduction}
In this lecture, we cover a topic that sits adjacent to conformal prediction in
a sense, which is calibration. We will cover the perspective(s) on calibration
from the forecasting literature, where it is arguably the most developed.
Instead of focusing solely on calibration, we'll cover some of the broader
theory of forecasting, pertaining to scoring rules, calibration, and ensembles.
There is a rich literature on forecast scoring and calibration. This literature
has roots in statistics, but over the years it has been expanded and driven by
applied forecasting disciplines, primarily weather forecasting. It is still
quite an active area in terms of development: new results that seem foundational
in nature are still being discovered year to year.
As per our usual comment, what we cover in this lecture is only a sample of what
is known in the theory of forecasting. Unlike other topics, we are not aware of
a book that gives a modern and comprehensive treatment of forecast scoring
rules, calibration, and so on. (Perhaps this would be premature, as these topics
are still in active development, and such a book will come later.) One of the
most active and influential researchers in forecasting theory today is Tilmann
Gneiting, and much of what we cover will be drawing from his work with
collaborators, such as \citet{gneiting2007strictly, gneiting2007probabilistic,
ranjan2010combining, gneiting2013combining}.
\subsection{Probabilistic forecasting}
Our setup in this lecture will be that of \emph{probabilistic} (as opposed to
\emph{point}) forecasting. In this setting, a forecaster outputs a predicted
distribution $P$ of a given target $Y$ of interest. We will generally focus on
the case in which $Y$ is real-valued. In probabilistic forecasting, both $P$ and
$Y$ are random variables. To help give context, we can think of an underlying
sequential problem indexed by time $t=1,2,3,\dots$, where:
\[
\text{a forecaster predicts $P_t$ for the distribution of $Y_t$, based on
information available at times $s<t$}.
\]
This sequential problem will be ``lurking in the background'' throughout, but
often we will hide the time dependence because it will be more convenient (and
just as general) to talk about a single random forecaster $P$ and random target
$Y$.
\subsection{Probabilities or quantiles?}
It is most common to represent a probabilistic forecast $P$ by its cumulative
distribution function (CDF) $F$. Of course, instead of representing $P$ by its
CDF $F$, we can choose to represent it by its quantile function
$F^{-1}$. To be clear, the quantile function $F^{-1}$ is always well-defined,
even when $F$ is not continuous, or injective, and it is defined precisely as
\[
F^{-1}(\tau) = \inf\{ x : F(x) \geq \tau\}.
\]
When $F$ is not continuous, this shares some but not all properties of the
usual inverse; recall the properties you derived on the homework.
An important note: typically these objects, the CDF $F$ or quantile function
$F^{-1}$, will be discretized in practice, and \emph{converting from one to the
other will not always be possible/practical, particularly in the tails}. Thus
whether we choose at the outset to represent a predicted distribution by a CDF
or quantile function can have nontrivial implications on what tools we have at
our disposal for downstream tasks, like evaluation and aggregation.
Later, in Section \ref{sec:prob_quant}, we will discuss the differences between
aggregating probabilities (CDFs) and aggregating quantiles, in detail. Here, we
provide some higher-level practical considerations to keep in mind when choosing
to represent forecasts via probabilities versus quantiles.
\begin{itemize}
\item A quantile representation for forecasts alleviates concerns about choosing
the ``right'' bins for probability forecasts. Choosing bins in practice can be
highly nonobvious, both in terms of the ``right'' scale, and the ``right''
range.
\item Quantiles can be directly fit a straightforward way, via quantile
regression, and this can be carried out jointly over multiple probability
levels. Constructing flexible (conditional) probability distributions can
often be a more complex modeling task.
\item For a discrete response that is concentrated around low values (such as
counts), a quantile representation may be undesirable since ``exact''
quantiles only exist at a coarse set of probability levels (e.g., a reported
level 0.05 quantile may actually be a level 0.09 quantile).
\end{itemize}
\section{Scoring rules}
We now turn to discussing scoring rules, for evaluating probabilistic
forecasts. A \emph{scoring rule} (or simply a score) is a function $S$ that maps
a predicted distribution $P$ and target value $y$ to a real or infinite number
\smash{$S(P, y) \in \widebar\R$} (where \smash{$\widebar\R = \R \cup \{\pm
\infty\}$} denotes the extended reals). We will, by convention, always take
the score to be negatively-oriented: a lower value means a better forecast.
Thus to be clear, in practice (in the sequential setting): we would compare two
forecasters \smash{$P_t, \tilde{P}_t$} over times $t=1,2,3,\dots$ by comparing
\[
\frac{1}{T} \sum_{t=1}^T S(P_t, Y_t) \quad \text{and} \quad
\frac{1}{T} \sum_{t=1}^T S(\tilde{P}_t, Y_t),
\]
with a lower average score suggesting a better forecaster.
In general, we use the notation $S(P, Q)$ to denote the expectation of the score
over draws $Y \sim Q$,
\[
S(P, Q) = \E_{Y \sim Q} [S(P, Y)] = \int S(P, y) \, dQ(y).
\]
\subsection{Proper scores}
A scoring rule $S$ is said to be \emph{proper} if
\begin{equation}
\label{eq:proper}
S(P, Q) \geq S(Q, Q), \quad \text{for all $P,Q$},
\end{equation}
and \emph{strictly proper} if strict inequality holds in \eqref{eq:proper} for
$P \not= Q$. To quote \citet{gneiting2007strictly}, when $Q$ denotes the
distribution of $Y$ and the score $S$ is proper: ``the forecaster has no
incentive to predict any $P \not= Q$ and is encouraged to report his or her true
belief''.
Several examples of proper scores are given below. (An example of an intuitively
appealing but improper score will also be given.) Throughout, we will flexibly
write the scores as taking densities, CDFs, or quantile functions as inputs
(instead of distributions), as convenient.
\paragraph{Log score.}
\def\KL{\mathrm{KL}}
For a forecast that has density or probability mass function $p$, \emph{log
score} is defined by
\begin{equation}
\label{eq:logs}
\mathrm{LogS}(p, y) = -\log p(y).
\end{equation}
This is a strictly proper score. To see this, we compute:
\begin{equation}
\label{eq:kl_div}
\mathrm{LogS}(p, q) - \mathrm{LogS}(q, q) = \int \log \frac{q(y)}{p(y)} q(y) \,
dy,
\end{equation}
which recall is the Kullback-Leibler (KL) divergence between $q$ and $p$, often
denoted $\KL(q,p)$ (note carefully the order of arguments here). This is known
to be nonnegative, and positive for $p \not= q$.
Log score \eqref{eq:logs} very sharply penalizes forecasts that place
insufficiently low probability on events that materialize: for small $p(y)$, the
score $-\log p(y)$ is very large (and asymptotes at $p(y) = 0$).
\paragraph{Interlude: connection to Bregman divergence.}
KL divergence is a special instance of a \emph{Bregman divergence}, which is of
the form (note carefully the order of arguments)
\begin{equation}
\label{eq:bregman}
d_\phi(q,p) = \phi(q) - \phi(p) - \langle D\phi(p), q - p \rangle,
\end{equation}
for a convex differentiable function $\phi$, with derivative $D\phi$, and
suitable inner product $\langle \cdot, \cdot \rangle$. In particular, KL
divergence \eqref{eq:kl_div} is a Bregman divergence \eqref{eq:bregman} when
$\phi(p) = \int p(y) \log p(y) \, dy$, and $\langle p, q \rangle = \int p(y)
q(y) \, dy$.
The connection between log score and Bregman divergence is no coincidence, and
the connection between proper scores and Bregman divergences runs much
deeper. If $\phi$ is strictly convex, then by the first-order characterization
for convexity, we have $d_\phi(q,p) \geq 0$, with strict inequality for $p \not=
q$. Due to this fact, if $S$ is any score for which we have the Bregman
representation (for strictly convex $\phi$)
\begin{equation}
\label{eq:bregman_rep}
S(p, q) - S(q, q) = d_\phi(q, p),
\end{equation}
then $S$ is strictly proper. In essence, the converse is true as well. The
formal details will be covered in Section \ref{sec:bregman_rep}.
\paragraph{Quadratic score.}
For a forecast that has density or probability mass function $p$,
\emph{quadratic} or \emph{Brier score} is defined by
\begin{equation}
\label{eq:quads}
\mathrm{QuadS}(p, y) = -2p(y) + \|p\|_2^2,
\end{equation}
where $\|p\|_2^2 = \int p(y)^2 \, dy$. This is a strictly proper score. To see
this, we compute:
\[
S(p, q) - S(q, q) = \|p\|_2^2 - 2 \int p(y) q(y) \, dy + \|q\|_2^2 =
\|p-q\|_2^2,
\]
which is the $L^2$ distance between $p$ and $q$. It is nonnegative, and
positive for $p \not= q$.
Quadratic score \eqref{eq:quads} is more robust than log score \eqref{eq:logs},
in the sense that it is gentler in penalizing forecasts that place low
probability on events that materialize.
\paragraph{Linear score.}
For a forecast that has density or probability mass function $p$,
\emph{linear score} is defined by
\[
\mathrm{LinS}(p, y) = -p(y).
\]
We mention this score because it may seem intuitively appealing, but is
\emph{not proper} (and is basically never used in practice). To see this, let
$q$ be the standard normal density, and let $p$ be the uniform density on the
interval $(-\epsilon, \epsilon)$, for some $\epsilon>0$ to be specified
later. Then
\begin{align*}
\mathrm{LinS}(p, q) - \mathrm{LinS}(q, q)
&= \int q(y)^2 \, dy - \frac{1}{2\epsilon}
\int_{-\epsilon}^\epsilon q(y) \, dy \\
&= \frac{1}{2\pi} \int e^{-y^2} \, dy - \frac{1}{\sqrt{2\pi}}
\frac{1}{2\epsilon} \int_{-\epsilon}^\epsilon e^{-y^2/2} \, dy \\
&= \frac{1}{\sqrt{2\pi}} \bigg( 1/\sqrt{2} - \frac{1}{2\epsilon}
\int_{-\epsilon}^\epsilon e^{-y^2/2} \, dy \bigg).
\end{align*}
The last display will be negative for small enough $\epsilon > 0$: in
particular, for \smash{$\epsilon < \sqrt{\log 2}$}, the integrand in the above
satisfies \smash{$e^{-y^2/2} > 1/\sqrt{2}$}, so
\[
\mathrm{LinS}(p, q) - \mathrm{LinS}(q, q) <
\frac{1}{\sqrt{2\pi}} \bigg( 1/\sqrt{2} - \frac{1}{2\epsilon}
\int_{-\epsilon}^\epsilon 1/\sqrt{2} \, dy \bigg) < 0.
\]
\paragraph{Continuous ranked probability score.}
Not all probabilistic forecasts can be expressed in terms of a density or
probability mass function. For example, some forecast distributions might place
a point mass at zero, and would be otherwise continuous. The \emph{continuous
ranked probability score} (CRPS) accommodates this and, in general, any
forecast expressed in terms of a predicted CDF $F$. It is defined by
\begin{equation}
\label{eq:crps}
\mathrm{CRPS}(F, y) = \int (F(x) - 1\{y \leq x\})^2 \, dx.
\end{equation}
This is a strictly proper score. To see this, we compute:
\begin{align*}
\mathrm{CRPS}(F, G) - \mathrm{CRPS}(G ,G)
&= \int \Big( F(x)^2 - G(x)^2 - 2(F(x) - G(x)) \E_{Y \sim G}[ 1\{Y \leq x\} ]
\Big) \, dx \\
&= \int \Big( F(x)^2 - G(x)^2 - 2(F(x) - G(x)) G(x) \Big) \, dx \\
&= \int (F(x) - G(x))^2 \, dx.
\end{align*}
This is the Cram{\'e}r-von Mises distance between $F$ and $G$. It is
nonnegative, and positive for $p \not= q$.
CRPS \eqref{eq:crps} has become recently quite popular in various applied
forecasting communities (e.g., in meterological sciences), in part because it is
considered more robust than log score, and as already mentioned, it is more
broadly applicable, beyond log \eqref{eq:logs} and quadratic \eqref{eq:quads}
scores. However, unlike the latter two, CRPS is nontrivial to compute. Some
authors have pursued quadrature-based approximations to the integral defining
CRPS. A simpler approximation is given by noting its connection to weighted
interval score/quantile loss, which will be covered just a bit later on.
\paragraph{Interval score.}
For an interval forecast expressed as $[\ell_\alpha, u_\alpha]$, where the lower
and upper endpoints $\ell_\alpha = F^{-1}(\alpha/2)$ and $u_\alpha =
F^{-1}(1-\alpha/2)$ are the predicted $\alpha/2$ and $1-\alpha/2$ quantiles of
the forecast distribution, \emph{interval score} is defined by
\[
\mathrm{IS}_\alpha( [\ell_\alpha, u_\alpha], y) =
(u_\alpha - \ell_\alpha) + \frac{2}{\alpha} \cdot
\mathrm{dist}(y, [\ell_\alpha, u_\alpha])
\]
where \smash{$\mathrm{dist}(y, S) = \inf_{x \in S} |x-y|$} for a set $S$. We can
see that this combines a reward for sharpness (first term) and a penalty for
miscoverage (second term). For a collection of interval forecasts
$[\ell_\alpha, u_\alpha]$, $\alpha \in \cA$, \emph{weighted interval score}
(WIS) is defined by
\begin{align}
\nonumber
\mathrm{WIS}_\cA \big( \{[\ell_\alpha, u_\alpha]\}_{\alpha \in \cA}, \, y \big)
&= \sum_{\alpha \in \cA} \alpha \cdot \mathrm{IS}_\alpha( [\ell_\alpha,
u_\alpha], y) \\
\label{eq:wis}
&=\sum_{\alpha \in \cA} \Big(\alpha (u_\alpha - \ell_\alpha) + 2 \cdot
\mathrm{dist}(y, [\ell_\alpha, u_\alpha]) \Big).
\end{align}
Weighted interval score (and thus interval score, as effectively a special case)
is a proper score for predicting the quantiles at levels $\alpha/2$ and
$1-\alpha/2$, over $\alpha \in \cA$. To be clear, here we say (by analogy to
\eqref{eq:proper}) that a score $S$ defined in terms of a discrete set of
predicted quantiles \smash{$\{q_\tau\}_{\tau \in \cT}$} is proper if it holds
that
\begin{equation}
\label{eq:proper_q}
S \big( \{q_\tau\}_{\tau \in \cT}, \, F \big) \geq S \big(
\{F^{-1}(\tau)\}_{\tau \in \cT}, \, F \big), \quad \text{for all
$\{q_\tau\}_{\tau \in \cT}$, and all $F$},
\end{equation}
where as usual \smash{$S(\{q_\tau\}_{\tau \in \cT}, \, F) = \E_{Y \sim F}[
S(\{q_\tau\}_{\tau \in \cT}, \, Y)]$}. Propriety of weighted interval score
can be verified by drawing a connection to quantile loss, covered shortly.
\paragraph{Quantile score.}
For a forecast expressed as a collection of predicted quantiles $q_\tau$, $\tau
\in \cT$, \emph{quantile score} (QS) is defined by
\begin{equation}
\label{eq:qs}
\mathrm{QS}_\cT \big( \{q_\tau\}_{\tau \in \cT}, \, y \big) =
\sum_{\tau \in \cT} \rho_\tau(y - q_\tau),
\end{equation}
where $\rho_\tau(u) = \tau |u|$ for $u \geq 0$ and $(1-\tau) |u|$ for $u < 0$,
often called the tilted $\ell_1$ loss at the level $\tau$. This is nothing else
than the loss that is the standard choice in \emph{quantile regression}. That
quantile score is proper, in the sense of \eqref{eq:proper_q}, follows from the
standard fact that the true level $\tau$ quantile is a minimizer of the expected
tilted $\ell_1$ loss at the level $\tau$.
\paragraph{Connection between WIS and QS.}
The following is an connection between WIS \eqref{eq:wis} and QS \eqref{eq:qs}:
for any set $\cA$ of error levels, if we define a corresponding set
\smash{$\cT = \cup_{\alpha \in \cT} \, \{ \alpha/2, 1-\alpha/2 \}$} of
probability levels, then it holds that
\begin{equation}
\label{eq:wis_qs}
\mathrm{WIS}_\cA \big( \{[\ell_\alpha, u_\alpha]\}_{\alpha \in \cA}, \, y \big)
= 2 \cdot \mathrm{QS}_\cT \big( \{q_\tau\}_{\tau \in \cT}, \, y \big).
\end{equation}
The equivalence \eqref{eq:wis_qs} between WIS and QS can be verified by
straightforward algebra. Apart from being interesting, and lending a new
interpretation to how quantile score \eqref{eq:qs} combines a sharpness reward
with a miscoverage penalty (recall the discussion following \eqref{eq:wis}), it
is also a useful observation as it provides an indirect but simple proof that
WIS \eqref{eq:wis} is proper, in the sense of \eqref{eq:proper_q}.
\paragraph{Connection between CRPS and QS.}
The connections continue: interestingly, CRPS \eqref{eq:crps} also has an
equivalent formulation in terms of QS: if $F$ differentiable, with density $f$,
and admits an expectation, $\int y f(x) \, dy < \infty$, then
\begin{equation}
\label{eq:crps_qs}
\int (F(x) - 1\{y \leq x\})^2 \, dx =
2 \int \rho_\tau(y - F^{-1}(\tau)) \, d\tau.
\end{equation}
The right-hand side is an integral of QS all probability levels $\tau \in [0,1]$.
This is quite interesting because these two error metrics---CRPS and QS---are
motivated from very different perspectives, not to mention different
parametrizations (CDF space versus quantile space).
The equivalence in \eqref{eq:crps_qs} appears to have been discovered by
\citet{laio2007verification}; their argument uses integration by parts, but it
ignores a few subtleties, and so here we reproduce the proof from
\citet{fakoor2021flexible}, which basically just fills in a few missing details.
Starting with the right-hand side in \eqref{eq:crps_qs}, we can substitute
$x=F^{-1}(\tau)$ to rewrite the integral as
\[
2 \int (1\{y \leq F^{-1}(\tau)\} - \tau) (F^{-1}(\tau)-y) \, d\tau =
2 \int (1\{y \leq x\} - F(x)) (x-y) f(x) \, dx.
\]
Let $u'(x)=2(1\{y \leq x\} - F(x)) f(x)$ and $v(x)=x-y$. The idea is now to use
integration by parts, but there are two subtleties. First, one has to be careful
about framing the antiderivative $u$ of $u'$, since $x \mapsto 1\{y \leq x\}$ is
not classically differentiable. Note that we can actually redefine $u'$ to be
\[
u'(x) = 2(1\{y \leq x\} - F(x)) (f(x) - \delta_y(x)),
\]
where $\delta_y$ is the Dirac delta function centered at $y$, because the
``extra'' piece integrates to zero:
\[
\int 2(1\{y \leq x\} - F(x)) (x-y) \delta_y(x) \, dx =
2(1\{y \leq x\} - F(x)) (x-y) \Big|_{x=y} = 0.
\]
With this new definition of $u'$, its antiderivative is rigorously
\[
u(x) = -(1\{y \leq x\} - F(x))^2,
\]
because, in the distributional sense, the derivative of the heavyside function
$x \mapsto 1\{y \leq x\}$ is indeed the delta function $\delta_y$. Thus we have
\begin{align*}
\int u'(x) v(x) \, dx
&= u(x) v(x) \Big|_{-\infty}^\infty - \int u(x) v'(x) \, dx \\
&= -(1\{y \leq x\} - F(x))^2 (x-y) \Big|_{-\infty}^\infty +
\int (1\{y \leq x\} - F(x))^2 \, dx.
\end{align*}
The second subtlety is to show that the first term above is indeed zero.
This is really a question of how fast the tails of $F$ decay. As $F$ yields an
expectation, note that we must have $1-F(x) \lesssim x^{-1}$ (by Markov's
inequality). Hence
\[
(1-F(x))^2 x \lesssim x^{-1} \to 0 \quad \text{as $x \to \infty$},
\]
and the other limit, as $x \to -\infty$, is similar.
\paragraph{Quick recap.}
Here is a quick recap of what we learned on our tour of scores. Log score
\eqref{eq:logs} and quadratic score \eqref{eq:quads} are classical choices, each
strictly proper, with the former being more aggressive at penalizing
underconfidence assigned to rare events. CRPS \eqref{eq:crps} is a more modern
alternative, also strictly proper, which has become popular due to its
robustness and generality, but is nontrivial to compute.
For forecasts specified in terms of quantiles at discrete probability levels,
WIS \eqref{eq:wis} and QS \eqref{eq:qs} are each proper, simple, and popular
choices, motivated from different perspectives; and in fact they turn out to be
equivalent \eqref{eq:wis_qs}. Moreover, CRPS turns out to be equivalent
to an integrated quantile score \eqref{eq:crps_qs} over all probability
levels. Thus a simple approximation to CRPS in practice would be to discretize
at any desired set of probability levels and use this to define WIS/QS;
propriety is always maintained, in the sense of \eqref{eq:proper_q}, for any
choice of discretization.
Figures \ref{fig:scores1} and \ref{fig:scores2} visualize some of these scores
in action.
\begin{figure}[p]
\centering
\includegraphics[width=0.975\textwidth]{scores1.png}
\caption{\it Various scores visualized as functions of $y$, based on the
predicted distribution plotted in green. Here WIS$^*$ and WIS denote two
versions of weighted interval score at a coarser and finer set of probability
levels, respectively. Credit: \citet{bracher2021evaluating}.}
\label{fig:scores1}
\bigskip\medskip
\includegraphics[width=0.975\textwidth]{scores2.png}
\caption{\it Comparison between log score and WIS for two predicted
distributions: $F$ in green and $G$ in red. Note that $G$ has a higher
expectation, while $F$ is more dispersed. If we consider an event $y$ that is
marked as a dashed vertical line, then we see log score prefers $F$ to $G$,
while WIS prefers $G$ to $F$. Credit: \citet{bracher2021evaluating}.}
\label{fig:scores2}
\end{figure}
\subsection{Bregman representation}
\label{sec:bregman_rep}
Below we state an important result from \citet{gneiting2007strictly} that shows
that the Bregman representation \eqref{eq:bregman_rep} which we saw was possible
for log score (and quadratic score, and CRPS) is ``no accident'', and in a
precise sense, characterizes \emph{all} proper scores.
Before stating the result, we must introduce some terminology and
concepts. Recall that by convention, we take all scores to be
negatively-oriented. First, we say a score $S$ is \emph{regular} if $S(P,P)$ is
real-valued for all $P$ (i.e., we allow $S(P, Q) = \infty$ for $P \not=
Q$). Regularity is essentially needed in order for subderivatives (defined
shortly) to make sense but we omit the details.
Next, we refine the definition of propriety: for a class $\cP$ of probability
distributions, we say a score $S$ is \emph{proper relative to $\cP$} if
\eqref{eq:proper} holds for $P,Q \in \cP$, with \emph{strictly proper relative
to $\cP$} again meaning that the inequality is strict for $P \not= Q$.
Lastly, for a function $\phi$ acting over a set of distributions $\cP$, with
each $P \in \cP$ having a sample space $\cY$, we say that $D\phi(P, \cdot)$ is a
\emph{subderivative} of $\phi$ at $P \in \cP$ provided that
\begin{equation}
\label{eq:subderiv}
\phi(Q) \geq \phi(P) + \int D\phi(P, y) \, d(Q-P)(y), \quad \text{for all $Q \in
\cP$}.
\end{equation}
We are now ready to state the result.
\begin{theorem}
Let $\phi$ be a convex function acting over a set of distributions $\cP$. Define
for $P \in \cP$ and $y \in \cY$ the regular score
\begin{equation}
\label{eq:phi_score}
S(P, y) = -\phi(P) - D\phi(P, y) + \int D\phi(P, z) \, dP(z),
\end{equation}
where $D\phi(P, \cdot)$ is a subderivative of $\phi$ at $P$. Then for any $P,Q
\in \cP$,
\begin{equation}
\label{eq:phi_div}
S(P, Q) - S(Q, Q) = \underbrace{\phi(Q) - \phi(P) - \int D\phi(P, y) \,
d(Q-P)(y)}_{d_\phi(Q,P)},
\end{equation}
which is the Bregman divergence with respect to $\phi$, and the score $S$ is
proper relative to $\cP$.
Conversely, if $S$ is a regular score that is proper relative to $\cP$, then $S$
can be written in the form \eqref{eq:phi_score} with respect to the convex
function $\phi(P) = -S(P, P)$, and the Bregman representation \eqref{eq:phi_div}
holds.
Finally, the above equivalence also holds when the terms convex and proper are
replaced by strictly convex and strictly proper, respectively.
\end{theorem}
\begin{proof}
If $\phi$ is convex and we define $S$ according to \eqref{eq:phi_score}, then
direct calculation leads to $S(P,Q) - S(Q,Q) = -\phi(P) - \int D\phi(P, y) \,
dQ(y) + \int D\phi(P, y) \, dP(y) + \phi(Q) + \int D\phi(Q, y) \, dQ(y) - \int
D\phi(Q, y) \, dQ(y)$ and the last two terms cancel, which establishes
\eqref{eq:phi_div}. Meanwhile, $d_\phi(Q, P) \geq 0$ for any $P,Q \in \cP$ by
definition of the subderivative \eqref{eq:subderiv}, which proves that $S$ is
proper relative to $\cP$.
For the other direction, if $S$ is regular and proper relative to $\cP$, then
letting $\phi(Q) = -S(Q, Q)$ is equivalent to letting
\[
\phi(Q) = -\inf_{P \in \cP} \, S(P, Q) = \sup_{P \in \cP} \, -S(P, Q).
\]
For fixed $P$, the function $Q \mapsto -S(P, Q)$ is linear, and hence convex,
and thus by the above $\phi$ is a pointwise supremum of convex functions and
hence itself convex. Furthermore, we can see that $D\phi(P, \cdot) = -S(P,
\cdot)$ is a valid subderivative of $\phi$, since \eqref{eq:subderiv} becomes
$-S(Q, Q) \geq -S(P, Q)$ for all $Q \in \cP$, which is satisfied because $S$ is
proper relative to $\cP$. Under the choice $D\phi(P, \cdot) = -S(P, \cdot)$, the
score representation \eqref{eq:phi_score} is a tautology. The Bregman
representation \eqref{eq:phi_div} again follows by simple algebra.
This completes the proof of the claimed equivalence, and the arguments for the
strict version of the result follow similarly.
\end{proof}
Even though our focus has been (and will continue to be) real-valued forecasts,
it is worth emphasizing that the previous theorem treats the sample space $\cY$
as arbitrary. In the case of categorical forecasts, where $\cY = \{1,\dots,k\}$,
we get the following corollary, which is originally due to
\citet{savage1971elicitation}. We denote the standard $k$-dimensional
probability simplex by \smash{$\Delta^k = \{ p \in \R^k : p \geq 0, \;
\sum_{i=1}^k p_i = 1\}$}.
\begin{corollary}
Let $\cY = \{1,\dots,k\}$. Then a regular score $S$, which we treat as acting on
a vector of probabilities $p \in \Delta^k$, is proper with respect to the set of
all distributions on $\{1,\dots,k\}$ if and only if
\[
S(p, i) = -\phi(p) - D_i\phi(p) + \langle D\phi(p), p \rangle , \quad \text{for
$i=1,\dots,k$},
\]
for a convex function $\phi$ on $\Delta^k$, where $D\phi(p)$ denotes the
subgradient of $\phi$ at $p$ (with components $D_i\phi(p)$, $i=1,\dots,k$). The
same equivalence also holds when the terms convex and proper are replaced by
strictly convex and strictly proper, respectively.
\end{corollary}
\section{Modes of calibration}
Next we turn to calibration. There are in fact many modes or ``flavors'' of
calibration, and they are related but not all equivalent. We will walk through
three such definitions, but first, we review some important preliminary
concepts, which will help motivate the definitions. (Side note: translating
calibration precisely to the sequential setting is a bit more tricky than it was
for scoring rules, since it requires more than just writing expectations as
averages; see \citet{gneiting2007probabilistic}.)
\subsection{Preliminary concepts}
\def\deq{\overset{d}{=}}
For any CDF $F$ (fixed or random) and target random variable $Y$, we define
\begin{equation}
\label{eq:pit}
F^*(Y) = V \cdot F(Y) + (1-V) \cdot F(Y^-),
\end{equation}
where $V \sim \mathrm{Unif}(0,1)$ is independent of $F$ and $Y$, and
\smash{$F(y^-) = \lim_{x \to y^-} F(x)$}. This is called the \emph{probability
integral transform} (PIT) associated with $F$ and $Y$. If $F$ is continuous
(i.e., if $F$ is a continuous function, and hence invertible) almost surely,
then the PIT reduces to $F^*(Y) = F(Y)$.
Recall that if we take $F$ to be \emph{fixed} and equal to the CDF of $Y$, then
$F^*(Y)$ (or $F(Y)$ in the continuous case) is itself standard uniform. You
proved this fact on the homework fact. We can write this simply as
\[
F^*(Y) \deq U,
\]
where here and throughout the remainder of these lecture notes section, we
denote $U$ to denote a standard uniform random variable, $U \sim
\mathrm{Unif}(0,1)$, independent of everything else that is random. Recall also
that if we take $F^{-1}$ to be \emph{fixed} and equal to the quantile function
of $Y$, then
\[
F^{-1}(U) \deq Y,
\]
which you also proved on the homework. The definitions of probabilistic and
marginal calibration, covered over the next two subsections, are based on
requiring the above two properties to hold, respectively, in the case when $F$
is also random.
\subsection{Probabilistic calibration}
A forecaster with predicted CDF $F$ is said to be \emph{probabilistically
calibrated} for a target $Y$ if
\begin{equation}
\label{eq:prob_calib}
F^*(Y) \deq U
\end{equation}
where recall $F^*(Y)$ is defined as in \eqref{eq:pit}, and $U \sim
\mathrm{Unif}(0,1)$. To be explicit, here both $F$ and $Y$ are
\emph{random}. Another name for probabilistic calibration is \emph{PIT
calibration}, and another way of writing \eqref{eq:prob_calib} is
\[
\P\Big( V \cdot F(Y) + (1-V) \cdot F(Y^-) \leq \tau \Big) = \tau, \quad
\text{for all $\tau \in [0,1]$},
\]
for $V \sim \mathrm{Unif}(0,1)$, independent of $F$ and $Y$, which is typically
how you will see it defined.
\paragraph{Quantile reformulation.}
When $F$ is continuous almost surely, then we can reinterpret probabilistic
calibration as follows: $F$ is calibrated for $Y$ if
\[
\P(Y \leq F^{-1}(\tau)) = \tau, \quad \text{for all $\tau \in [0,1]$}.
\]
This is highly intuitive; for example, when we inspect the forecaster's
predicted quantile at the level 0.9, the target $Y$ should lie below this 90\%
of the time, and so on, for all probability levels.
\paragraph{Dispersion: over and under.}
If $F$ places too little mass in the tails, then the PIT $F^*(Y)$ will be
U-shaped, and its variance will be large compared to that of a uniform
distribution. Conversely, if $F$ places too much mass in the tails, then the PIT
$F^*(Y)$ will be upside-down U-shaped, and its variance will be comparably
small.
Recalling that we use $U \sim \mathrm{Unif}(0,1)$ to denote a standard uniform
random variable, this leads us to the following definitions, in the context of
PIT calibration:
\begin{itemize}
\item $F$ is said to be \emph{overdispersed} for $Y$ if $\Var[F^*(Y)] < \Var[U]$;
\item $F$ is said to be \emph{underdispersed} for $Y$ if $\Var[F^*(Y)] >
\Var[U]$.
\end{itemize}
Figures \ref{fig:pit1} and \ref{fig:pit2} provide examples.
\begin{figure}[p]
\centering
\includegraphics[width=0.7\textwidth]{pit1.png}
\vspace{-15pt}
\caption{\it Densities of PIT distributions for several simple normal
forecasters, when the true target distribution is $N(0,1)$. Credit:
\citet{rumack2022recalibrating}.}
\label{fig:pit1}
\bigskip\medskip
\includegraphics[width=0.85\textwidth]{pit2.png}
\caption{\it Densities of PIT distributions from 27 forecasters submitted to the
annual FluSight (seasonal influenza forecasting) challenges held by CDC, over
9 seasons (2010-11 to 2018-19). The PIT densities fall mostly into one of two
categories: overdispersed with a peak around 0.5, and underdispersed with
peaks at 0 and 1. (The outlier with a peak at 0.1 is the PIT density of a
simple baseline forecaster.) Credit: \citet{rumack2022recalibrating}.}
\label{fig:pit2}
\end{figure}
\subsection{Marginal calibration}
A forecaster with predicted CDF $F$ is said to be \emph{marginally calibrated}
for a target $Y$ if
\begin{equation}
\label{eq:marginal_calib}
F^{-1}(U) \deq Y,
\end{equation}
where recall $U \sim \mathrm{Unif}(0,1)$. Another way of expressing this is
\[
\E[F(y)] = \P(Y \leq y), \quad \text{for all $y$},
\]
which is typically how you will see it defined. This also explains its name.
Interestingly, marginal calibration is \emph{not} the same as probabilistic
calibration, and it is neither more general nor less general. There are
nontrivial forecasters that satisfy \eqref{eq:prob_calib} but not
\eqref{eq:marginal_calib}, and vice versa, to be discussed shortly.
\paragraph{Dispersion: over and under.}
When $F$ is random and $U \sim \mathrm{Unif}(0,1)$ is independent of $F$, note
that we can then interpret $F^{-1}(U)$ as follows: first draw $F$, then draw a
random variable according to $F$. Thus if $F$ places too much mass on the tails,
then the variance of $F^{-1}(U)$ will be comparably large, and if $F$ places
insufficient mass in the tails, then the variance of $F^{-1}(U)$ will be
comparably small.
This leads to the following definitions, in the context of marginal calibration:
\begin{itemize}
\item $F$ is said to be \emph{overdispersed} for $Y$ if $\Var[F^{-1}(U)] > \Var[Y]$;
\item $F$ is said to be \emph{underdispersed} for $Y$ if $\Var[F^{-1}(U)] < \Var[Y]$.
\end{itemize}
\paragraph{PIT versus marginal calibration.}
PIT calibration is a statement about the \emph{joint} distribution of the
forecaster $F$ and target $Y$. Marginal calibration is not; it is only a
statement, as its name suggests, about the \emph{marginal} distributions of $F$
and $Y$. In a sense, it can hence be much simpler to understand and reason about
than PIT calibration.
To give an example, when we think about systematic miscalibration---a forecaster
being overdispersed or underdispersed---we might often think about this in terms
of the forecast distribution being too spread out or too peaked, respectively.
But this intuition is really only justified for marginal calibration, and it is an
incomplete way to think about PIT calibration. This is because the latter is
about how $F$ and $Y$ behave \emph{jointly}. Recall, overdispersion means that
the PIT is too peaked, and underdispersion means that the PIT is too spread
out. This can be entirely due to the dependence between $F$ and $Y$, i.e., it
can happen even when the forecast distribution $F$ is marginally calibrated---it
has ``just enough spread''.
\paragraph{Examples.}
Following \citet{gneiting2007probabilistic}, suppose $\mu$ is drawn from $N(0,
1)$, and $Y$ is drawn (independently of $\mu$) from $N(\mu, 1)$. Consider (where
we use $F$ to denote both a distribution and a CDF):
\begin{itemize}
\item the \emph{ideal} forecaster: $F = N(\mu, 1)$;
\item the \emph{climatological} forecaster: $F = N(0, 2)$;
\item the \emph{flipped} forecaster: $F = N(-\mu, 1)$;
\item the \emph{unfocused} forecaster: \smash{$F = \frac{1}{2}[ N(\mu, 1) +
N(\mu + \xi, 1)]$}, for $\xi = \pm 1$ with equal probability, independent
of $Y,\mu$.
\end{itemize}
To gain intuition for the setup, you can think of an underlying sequential
problem, where at time $t$, we draw $\mu_t \sim N(0,1)$, then draw $Y_t \sim
N(\mu_t, 1)$ independently, and the forecasters are $F_t = N(\mu_t, 1)$ (ideal),
$F_t = N(0, 2)$ (climatological), and so on.
Direct calculations, as given in Section 2 of \citet{gneiting2007probabilistic},
reveal the following:
\begin{itemize}
\item the ideal forecaster is both probabilistically and marginally calibrated;
\item the climatological forecaster is both probalistically and marginally
calibrated;
\item the flipped forecaster is marginally but not probabilistically
calibrated;
\item the unfocused forecaster is probabilistically but not marginally
calibrated.
\end{itemize}
\subsection{Conditional calibration}
When $Y$ is binary, we can think of a forecaster as providing a predicted
probability $p$ of the event $Y = 1$. In this setting, a forecaster $p$ is said
to be \emph{conditionally calibrated} if
\begin{equation}
\label{eq:cond_calib}
\E[Y|p] = p, \quad \text{almost surely}.
\end{equation}
This is highly intuitive; for example, when the forecaster outputs a predicted
probability of 0.2, the event $Y=1$ should materialize 20\% of the time, and so
on, for all probability levels.
Interestingly, and perhaps surprisingly, this is the same as probabilistic
calibration in the binary seting. This is due to \citet{gneiting2013combining}.
\begin{theorem}
Let $Y$ be binary valued, and suppose a forecaster outputs a predicted
probability $p$ of the event $Y = 1$, with associated predicted CDF
\[
F(y) = (1-p) \cdot 1\{y \geq 0\} + p \cdot 1\{y \geq 1\}.
\]
Then $F$ is probabilistically calibrated as in \eqref{eq:prob_calib} if and only
if $p$ is conditionally calibrated as in \eqref{eq:cond_calib}.
\end{theorem}
\begin{proof}
Supposing $p$ is probabilistically calibrated, we can write, for $V \sim
\mathrm{Unif}(0,1)$, independent of everything else,
\[
F^*(Y) = V (1-p) \cdot (1-Y) + (1-p + V p) \cdot Y.
\]
Conditional calibration \eqref{eq:cond_calib} says $Y\,|\,p \sim
\mathrm{Bern}(p)$, and fixing any $\tau \in [0,1]$, we can compute
\begin{align*}
\P(F^*(Y) \leq \tau \,|\, p)
&= \P\big( V (1-p) (1-Y) + (1-p + V p) Y \leq \tau \,\big|\, p \big) \\
&= \P\big( V (1-p) \leq \tau \,\big|\, p \big) \cdot (1-p) + \P\big( V p \leq
\tau-(1-p) \,\big|\, p \big) \cdot p.
\end{align*}
Denoting $a = \P(V (1-p) \leq \tau \,|\, p)$ and $b = \P(V p \leq \tau-(1-p)
\,|\, p)$, observe that
\[
a = \frac{\tau}{1-p} \wedge 1,
\quad
b = \frac{\tau-(1-p)}{p} \vee 0.
\]
Thus when $\tau \leq 1-p$, we get $\P(F^*(Y) \leq \tau \,|\, p) = a(1-p) =
\tau$, and when $\tau > 1-p$, we get $\P(F^*(Y) \leq \tau \,|\, p) = (1-p) + bp
= \tau$. Marginalizing over $p$ proves the probabilistic calibration property
\eqref{eq:prob_calib}.
The other direction, that probabilistic calibration implies conditional
calibration, is less elementary and we refer to the proof of Theorem 2.11 in
\citet{gneiting2013combining} for details.
\end{proof}
It is also interesting to remark that this equivalence does \emph{not} extend
beyond binary outcomes. For three or more distinct levels of a discrete outcome,
it is no longer true that PIT calibration implies conditional
calibration. \citet{gneiting2022regression} provide a counterexample.
More broadly, it is worth remarking that the generalization of
\eqref{eq:cond_calib} beyond the binary setting is sometimes called
\emph{auto-calibration}, which requires the predicted CDF $F$ to satisfy
\begin{equation}
\label{eq:auto_calib}
Y \,|\, F \deq F^{-1}(U) \,|\, F, \quad \text{almost surely}.
\end{equation}
A draw from $F^{-1}(U) \,|\, F$ is simply a random variable distributed
according to $F$. Thus, in other words, the property \eqref{eq:auto_calib} says
that almost surely, conditional law of $Y \,|\, F$ should indeed be $F$. You'll
sometimes see this written as $\cF[Y|F] = F$, almost surely.
Auto-calibration \eqref{eq:auto_calib} is stronger than all notions of
calibration we have seen thus far, and it implies both probabilistic calibration
\eqref{eq:prob_calib} and marginal calibration
\eqref{eq:marginal_calib}. However, in general, it is not really possible to
check whether auto-calibration holds in practice. See
\citet{gneiting2022regression} for a discussion of this and related topics.
\section{Probability versus quantile aggregation}
\label{sec:prob_quant}
Model aggregation is a rich and important topic in it of itself, and could
easily be the topic of its own lecture. To give a broader context, model
aggregation methods---also called ensemble methods---occupy a central place in
machine learning, both in theory and in practice. Seminal work on this topic
arose in the 1990s, with the development of Bayesian model averaging, bagging,
boosting, and stacking.
The machine learning literature has mostly focused on ensembling point
predictions, while ensembling distributions has an even longer tradition
in probabilistic forecasting, dating back to the 1960s. Over the next two
sections, we'll focus on a simple class of aggregation methods, linear ones,
from the perspective of probabilistic forecasting. The current section is
adapted from Section 3 of \citet{fakoor2021flexible}.
For convenience, we'll use the term \emph{average} to refer to a weighted linear
combination where the weights are nonnegative and sum to 1. For each
$j=1,\dots,p$, let $F_j$ be a cumulative distribution function (CDF), assumed
continuous, and let $f_j=F_j'$ be its probability density function; let
\smash{$Q_j = F^{-1}_j$} denote the corresponding quantile function, also
assumed continuous, and let $q_j=Q_j'$ the quantile density function. A standard
fact that relates these
objects:
\begin{equation}
\label{eq:prob_quant}
q_j(t) = \frac{1}{f_j(Q_j(t))} \quad \text{and} \quad
f_j(x) = \frac{1}{q_j(F_j(x))}.
\end{equation}
The first fact can be checked by differentiating $Q_j(F_j(x)) = x$, applying the
chain rule, and reparametrizing via $t=F_j(x)$. The second follows similarly via
$F_j(Q_j(t)) = t$.
\subsection{Two ways of averaging}
We compare and contrast two ways of averaging distributions. The first way is in
probability space, where we define for weights $w_j \geq 0$, $j=1,\dots,p$ such
that \smash{$\sum_{j=1}^p w_j = 1$},
\[
F = \sum_{j=1}^p w_j F_j.
\]
The associated density is simply \smash{$f = \sum_{j=1}^p w_j f_j$} since
differentiation is a linear operator. The second way to average is in quantile
space, defining
\[
\bar{Q} = \sum_{j=1}^p w_j Q_j,
\]
where now $\bar{q} = \sum_{j=1}^p w_j q_j$ is the associated quantile density,
again by linearity of differentiation. Denote the CDF and probability density
associated with the quantile average by \smash{$\bar{F} = \bar{Q}^{-1}$}, and
\smash{$\bar{f} = \bar{F}'$}. Note that from \eqref{eq:prob_quant}, we can
reason that \smash{$\bar{f}$} is highly \emph{nonlinear} as a function of $f_j$,
$j=1,\dots,p$.
A simple example can go a long way to illustrate the differences between the
distributions resulting from probability and quantile averaging. In Figure
\ref{fig:prob_quant}, we compare these two ways of averaging on a pair of
normal distributions with different means and variances. Here we see that
probability averaging produces the familiar mixture of normals, which is
bimodal. The result of quantile averaging is very different: it is always
unimodal, and instead of interpolating between the tail behaviors of $f_1$ and
$f_2$ (as $f$ does), it appears that \emph{both} tails of \smash{$\bar{f}$} are
generally thinner than those of $f_1$ and $f_2$.
\begin{figure}[htb]
\centering
\includegraphics[width=0.9\textwidth]{prob_quant.pdf}
\caption{\it Densities that result from a probability average (left) or quantile
average (right) of two normal distributions $N(1,0.25^2)$ and $N(3, 0.5^2)$,
as the weight on the first density varies from 1 (drawn in orange) to 0
(drawn in blue). Credit: \citet{fakoor2021flexible}.}
\label{fig:prob_quant}
\end{figure}
It seems that quantile averaging is doing something that is both like
translation and scaling in probability density space. Next we explain this
phenomenon precisely by recalling a classic result.
\subsection{Shape preservation}
An aggregation procedure $H$ is said to be \emph{shape-preserving} if, for any
location-scale family $\cP$ (such as Normal, t, Laplace, or Cauchy) whose
elements differ only by scale and location parameters, we have
\[
F_j \in \cP, \, j=1,\dots,p \implies F = H(F_1,\dots,F_p) \in \cP.
\]
Probability averaging is clearly not shape-preserving, however, interestingly,
quantile averaging is: if each $F_j$ a member of the same location-scale family
with a base CDF $L$, then we can write $F_j(x) = L((x-\theta_j)/\sigma_j)$, and
thus $Q_j(t) = \theta_j + \sigma_j L^{-1}(t)$, so \smash{$\bar{Q}$} is still of
the form $\theta + \sigma L^{-1}$ and \smash{$\bar{F}$} is also in the
location-scale family. The next proposition collects this and related results
from the literature, due to \citet{thomas1980appropriate,
genest1992vincentization}.
\begin{proposition}
\label{prop:shape_preservation} \hfill
\begin{itemize}
\item[(i)] Quantile averaging is shape-preserving.
\item[(ii)] Location-scale families $\cP$ are the only ones with respect to
which quantile averaging is a closed operation (meaning $F_j \in \cP$,
$j=1,\dots,p$ implies \smash{$\bar{F} \in \cP$}).
\item[(iii)] Quantile averaging is the only aggregation method $H$, of those
satisfying (for $h$ not depending on $t$):
\[
H(F_1,\dots,F_p)^{-1}(t) = h(Q_1(t),\dots,Q_p(t)),
\]
that is shape-preserving.
\end{itemize}
\end{proposition}
The parts of Proposition \ref{prop:shape_preservation}, taken together, suggest
that quantile averaging is somehow ``tailor-made'' for shape preservation in a
location-scale family---which can be seen as either a pro or a con, depending on
the application one has in mind. To elaborate, suppose that in a quantile
regression ensembling application, each base model outputs a normal distribution
for its predicted distribution at each $x$ (with different means and
variances). If the normal assumption is warranted (i.e., it actually describes
the data generating distribution) then we would want our ensemble to retain
normality, and quantile averaging would do exactly this. But if the normal
assumption is used only as a working model, and we are looking to combine base
predictions as a way to construct some flexible and robust model, then the
shape-preserving property of quantile averaging would be problematic. In
general, to model arbitrary distributions without imposing strong assumptions,
we are therefore driven to use linear combinations of quantiles that allow
\emph{different aggregation weights to be used for different quantile levels},
of the form \smash{$\bar{Q}(t) = \sum_{j=1}^p w_j(t) Q_j(t)$}.
\subsection{Moments and sharpness}
Next, we consider moments of the distributions returned by probability and
quantile averages, recalling a result due to \citet{lichtendahl2013better}. For
a distribution $G$, we denote its uncentered moment of order $k \geq 1$ by
\smash{$m_k(G) = \E_{X \sim G}[X^k]$}.
\begin{proposition}
\label{prop:moments_sharpness} \hfill
\begin{itemize}
\item[(i)] A probability and quantile average always have equal means:
\smash{$m_1(F) = m_1(\bar{F})$}.
\item[(ii)] A quantile average is always sharper than a probability average:
\smash{$m_k(\bar{F}) \leq m_k(F)$} for any even $k \geq 2$.
\end{itemize}
\end{proposition}
Note that sharpness is only a desirable property if it does not come at the
expense of calibration. With this in mind, the above result cannot be understood
as a pro or con of quantile averaging without any context on calibration---this
will be studied in Section \ref{sec:aggr_calib}. That said, the relative
sharpness of quantile averages to probability averages is an important general
phenomenon to be aware of.
\subsection{Tail behavior}
Lastly, we study the action of quantile averaging on the tails of the subsequent
probability density. Simply starting from \smash{$\bar{Q} = \sum_{j=1}^p w_j
Q_j$}, differentiating, and using \eqref{eq:prob_quant}, we get
\[
\frac{1}{\bar{f}(\bar{Q}(t))} = \sum_{j=1}^p \frac{w_j}{f_j(Q_j(t))}.
\]
That is, the probability density \smash{$\bar{f}$} at the level $t$ quantile is
a (weighted) \emph{harmonic mean} of the densities $f_j$ at their respective