# Confidence envelopes for ROC curves

A ROC curve $\mathrm{Roc}:[0,1]\to[0,1]$ packs a lot of information about classifer. 
In particular, it allows you to choose good tradeoff points between recall and precision or compare the performance of different classifiers.
Note that the we cannot compute the true ROC curve, as it is determined by the exact data distributions of positive and negative classes.
Instead, we can only compute the approximation on a test set and thus the graph is bound to fluctuate.
By computing the confidence envelopes we could characterise the uncertainty.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import numpy.random as rnd

from pandas import Series
from pandas import DataFrame

from tqdm import tnrange
from plotnine import *

# Local imports
from common import *
from convenience import *

## I. Theoretical model behind ROC curve

First note that the ideal ROC curve is given by the following parametric equation

\begin{align*}
\begin{cases}
\alpha(\tau) = \Pr[\boldsymbol{x}\gets\mathcal{D}_{-}: f(\boldsymbol{x})\geq \tau]\\
\beta(\tau) = \Pr[\boldsymbol{x}\gets\mathcal{D}_{+}: f(\boldsymbol{x})\geq \tau]
\end{cases}
\end{align*}

where 
* $\mathcal{D}_{+}$ and $\mathcal{D}_{-}$ are data distributions of positive and negative cases, 
* $f(\boldsymbol{x})$ is the descision value assigned to data point $\boldsymbol{x}$,
* $\tau$ is the threshold above which all decision values are declared as positive. 

From this definition it is easy to see that $\alpha(\tau)$ is the ratio of false positives and $\beta(\tau)$ is the ratio of false negatives in a setting where the number of test samples is infinite. 

In practice we observe $N$ negative samples and $P$ positive samples in the test set. If all test samples are taken as iid samples then  
* all negative samples are iid samples from $\mathcal{D}_{-}$
* all positive samples are iid samples from $\mathcal{D}_{+}$

Thus the number of false positives $\mathrm{FP}$ and true positives $\mathrm{TP}$ are distributed according to binimial distribution: 

\begin{align*}
\mathrm{FP}&\sim\mathrm{Bin}(n=N, p=\alpha(\tau))\\
\mathrm{TP}&\sim\mathrm{Bin}(n=P, p=\beta(\tau))
\end{align*}

From this it is straightforward to assign a probability to the pair $(\mathrm{FP},\mathrm{TP})$, as the number of true positives and false positives are independent if the number of negative and number of positive cases is fixed.


## II. How to define a statistical test for the pair of ratios

In order to get further we must define a single test statistic for which we can compute the probability.
This is hard question as the answer to this question will determine how narrow will be the final confidence envelope for the ROC curve. There are three natural ways ways to define the test

### The acceptance region will be a square box

Given parameters $N$, $P$, $\alpha$, $\beta$, we can easily compute the probability assigned to square box centered around expected values of $FP$ and $TP$:  

\begin{align*}
f(\Delta_1,\Delta_2)= \Pr[|FP-N\alpha|\leq \Delta_1\wedge |TP-P\beta|\leq \Delta_2]
\end{align*}

and minimise $\Delta_1$ and $\Delta_2$ so that $f(\Delta_1,\Delta_2)\geq 1-\rho$ where $\rho$ is the desired significance level. This will give us a statistical test


### The acceptance region will cells with maximal probability

As we can compute probability for any $(FP, TP)$ pair we can order them and reject the tail with weight $\rho$. This will be a sligtly more powerful test, as we accept the regioon of highest probability mass, i.e., points that are typical to the distribution.


### Acceptance region will be ellips determined by normal approximation

Note that for moderately large values of $N$ and $P$ the binomial distribution converges to normal distribution. 
As $FP$ and $NP$ are independent we can approximate $(FP,TP)$ by a normal distribution where only diagonals contain nonzero entries. By proper scaling we can make sure that $(\gamma_1\cdot FP,\gamma_2\cdot TP)$ can be approximated with white gaussian noise $\mathcal{N}(0, I)$. For that distribution, the best test statistic is distance square
$\gamma_1^2\cdot FP^2 +\gamma_2^2\cdot TP^2$ that is distributed according to $\chi$-distribution with two degrees of freedom. This gives a rise to another test.






# Homework

## 4.1 Compare tests for the pair of ratios* (<font color='red'>3p</font>)  

Implement all of three tests and compare their acceptance regions for the case $N=P$ and $N\in\{10, 50, 100, 1000\}$.
Compare the power of these tests by considering hypotheses $(\alpha,\beta)$ and $(\alpha+\delta, \beta+\delta)$
* for some reasonable $\delta$ value in the range $[0, 0.1]$
* for some reasonable $\alpha, \beta$ values in the box $[0,1]\times[0,1]$.   

Can you decide which test is the best to use in computing confidence intervals, as there we are going to compare hypotheses $(\alpha,\beta)$ and $(\alpha+\delta_1, \beta+\delta_2)$?


## 4.1 Build  confidence envelope for the ROC curve* (<font color='red'>3p</font>)  

Use the naive grid testing apprach to build confidence envelope:
* For each $(FP,TP)$ pair in the empirical ROC curve compute acceptance region for $(\alpha_i,\beta_i)$
* Merge all acceptance regions and compute convex hull around it 
* Declare the result as confidence envelope.

Test whether the construction gives indeed a pointwise confidence envelope. For each  $(FP,TP)$ pair the corresponding estimate $FP/N,TP/P$ is guaranteed to be in the envelope with confidence at least $1-\rho$ by construction but if individual regions overlap the confidence can be much larger. 


In [None]:
%config IPCompleter.greedy=True