# Confidence envelopes for ROC curves

A ROC curve $\mathrm{Roc}:[0,1]\to[0,1]$ packs a lot of information about a classifer. 
In particular, it allows to choose good trade-off points between recall and precision, or compare the performance of different classifiers.
Note that we cannot compute the true ROC curve as it is determined by the exact data distributions of positive and negative classes.
Instead we can only compute the approximation on a test set and thus the graph is bound to fluctuate.
By computing the confidence envelopes, we could characterise the uncertainty.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import numpy.random as rnd

from pandas import Series
from pandas import DataFrame

from tqdm import tnrange
from plotnine import *

# Local imports
from common import *
from convenience import *

## I. Theoretical model behind the ROC curve

First note that the ideal ROC curve is given by the following parametric equation

\begin{align*}
\begin{cases}
\alpha(\tau) = \Pr[\boldsymbol{x}\gets\mathcal{D}_{-}: f(\boldsymbol{x})\geq \tau]\\
\beta(\tau) = \Pr[\boldsymbol{x}\gets\mathcal{D}_{+}: f(\boldsymbol{x})\geq \tau]\enspace,
\end{cases}
\end{align*}

where 
* $\mathcal{D}_{+}$ and $\mathcal{D}_{-}$ are the data distributions of positive and negative cases, respectively
* $f(\boldsymbol{x})$ is the decision value assigned to data point $\boldsymbol{x}$
* $\tau$ is the threshold above which all decision values are declared as positive

From this definition it is easy to see that $\alpha(\tau)$ is the ratio of false positives and $\beta(\tau)$ is the ratio of true positives in a setting where the number of test samples is infinite. 

In practice we observe $N$ negative samples and $P$ positive samples in the test set. If all the test samples are iid then  
* all the negative samples are iid samples from $\mathcal{D}_{-}$
* all the positive samples are iid samples from $\mathcal{D}_{+}$

Thus the number of false positives $\mathrm{FP}$ and true positives $\mathrm{TP}$ are distributed according to binomial distributions: 

\begin{align*}
\mathrm{FP}&\sim\mathrm{Bin}(n=N, p=\alpha(\tau))\\
\mathrm{TP}&\sim\mathrm{Bin}(n=P, p=\beta(\tau))
\end{align*}

As the number of true and false positives are independent if the number of negative and positive cases are fixed, it is straightforward to assign a probability to the pair $(\mathrm{FP},\mathrm{TP})$.

## II. How to define a statistical test for the pair of ratios

In order to get further, we must define a single test statistic for which we can compute the probability. This will determine how narrow will be the confidence envelope for the ROC curve, hence we must be careful. There are three natural ways to define the test statistic.

### The acceptance region will be a square box

Given parameters $N$, $P$, $\alpha$, $\beta$, we can easily compute the probability assigned to the square box centered around the expected values of $FP$ and $TP$:  

\begin{align*}
f(\Delta_1,\Delta_2)= \Pr[|FP-N\alpha|\leq \Delta_1\wedge |TP-P\beta|\leq \Delta_2]
\end{align*}

Minimising $\Delta_1$ and $\Delta_2$ such that $f(\Delta_1,\Delta_2)\geq 1-\rho$, where $\rho$ is the desired significance level, will give us a statistical test.


### The acceptance region will be the cells with maximal probability

As we can compute the probability for any $(FP, TP)$ pair, we can order them and reject the tail with weight $\rho$. This will be a slightly more powerful test as we accept the region of the highest probability mass, i.e., the points that are typical to the distribution.


### Acceptance region will be an ellips determined by a normal approximation

Note that for moderately large values of $N$ and $P$, the binomial distribution converges to a normal distribution. 
As $FP$ and $NP$ are independent, we can approximate $(FP,TP)$ by a normal distribution where only the diagonals contain non-zero entries. By proper scaling, we can make sure that $(\gamma_1\cdot FP,\gamma_2\cdot TP)$ can be approximated with white Gaussian noise $\mathcal{N}(0, I)$. For that distribution the best test statistic is distance squared
$\gamma_1^2\cdot FP^2 +\gamma_2^2\cdot TP^2$, which is distributed according to a $\chi^2$-distribution with two degrees of freedom. This gives rise to another test.

# Homework

## 4.1 Compare tests for the pair of ratios* (<font color='red'>3p</font>)  

Implement all of the three tests and compare their acceptance regions for the case $N=P$ and $N\in\{10, 50, 100, 1000\}$.
Compare the power of these tests by considering the hypotheses $(\alpha,\beta)$ and $(\alpha+\delta, \beta+\delta)$:
* for some reasonable $\delta$ values in the range $[0, 0.1]$
* for some reasonable $\alpha, \beta$ values in the box $[0,1]\times[0,1]$

Can you decide which test is the best to use in computing confidence intervals for comparing hypotheses $(\alpha,\beta)$ and $(\alpha+\delta_1, \beta+\delta_2)$?

## 4.2 Build  a confidence envelope for the ROC curve* (<font color='red'>3p</font>)  

Use the naive grid testing approach to build a confidence envelope:
* For each $(FP,TP)$ pair in the empirical ROC curve compute the acceptance region for $(\alpha_i,\beta_i)$.
* Merge all the acceptance regions and compute convex hull around it.
* Declare the result as the confidence envelope.

Test whether the construction gives indeed a pointwise confidence envelope. For each $(FP,TP)$ pair the corresponding estimate $FP/N,TP/P$ is guaranteed to be in the envelope with confidence at least $1-\rho$ by construction, but if individual regions overlap the confidence can be much larger. 

In [None]:
%config IPCompleter.greedy=True