# Relative performance evaluation

<img src="../illustrations/relative_accuracy.png" width=50%  />


In most cases the test set is small and thus the test error can fluctuate so much that we cannot detect whether one model is better than the other.
Then there is clever trick to compare relative performance of different models. 
This measures the difference between model performance measures an is much more precise than the difference of absolute errors provided that we have large number of unlabelled data.
On flip side, it requires manual evaluation of unseen data samples and thus cannot be automated. 


## Set of differences

Let $f_1$ and $f_2$ be classifiers which we want to compare and let $R$ be the risk (performance measure) which can be expressed as an expected loss over a random data point

\begin{align*}
R(f)=\mathbf{E}(L(y, f(\boldsymbol{x})))\enspace.
\end{align*}

Then the empirical risk (an estimate of performance measure) is an average

\begin{align*}
R_N(f)=\frac{1}{N}\cdot \sum_{i=1}^{N} L(y_i, f(\boldsymbol{x}_i)))\enspace.
\end{align*}

Consequently, we can estimate risk difference

\begin{align*}
\Delta R = R(f_1)-R(f_2)
\end{align*}

through the differences of empirical risks


\begin{align*}
\Delta R_N = \frac{1}{N}\cdot \sum_{i=1}^{N} L(y_i, f_1(\boldsymbol{x}_i))-L(y_i, f_2(\boldsymbol{x}_i))\enspace.
\end{align*}

Now it is evident that when $f_1(\boldsymbol{x}_i) = f_2(\boldsymbol{x}_i)$ then corresponding losses are the same and thus we need labels $y_i$ only for the set of differences 

\begin{align*}
\mathcal{I_\Delta}=\{ i: f_1(\boldsymbol{x}_i)\neq f(\boldsymbol{x}_i)\}\enspace.
\end{align*}

In particular note that


\begin{align*}
\Delta R_N = \frac{1}{N}\cdot \sum_{i\in \mathcal{I_\Delta}} L(y_i, f_1(\boldsymbol{x}_i))-L(y_i, f_2(\boldsymbol{x}_i))= \frac{|\mathcal{I}_\Delta|}{N}\cdot\frac{1}{|\mathcal{I}_\Delta|}\cdot \sum_{i\in \mathcal{I_\Delta}} L(y_i, f_1(\boldsymbol{x}_i))-L(y_i, f_2(\boldsymbol{x}_i))\enspace.
\end{align*}

## Performance on the set of differences

Let us now define the performance on the set of differences

\begin{align*}
\Delta R_{\mathcal{I}_\Delta} = \frac{1}{|\mathcal{I}_\delta|}\cdot \sum_{i\in \mathcal{I_\Delta}} L(y_i, f_1(\boldsymbol{x}_i)))-L(y_i, f_2(\boldsymbol{x}_i)))\enspace.
\end{align*}

and its relative size

\begin{align*}
\hat{p}_\Delta=\frac{|\mathcal{I}_\Delta|}{N}
\end{align*}

Then the empirical risk difference can be expressed as

\begin{align*}
\Delta R_N= \hat{p}_\Delta\cdot \Delta R_{\mathcal{I}_\Delta}.
\end{align*}

which implies

\begin{align*}
|\Delta R_N| \leq \hat{p}_\Delta\cdot \max |L(y, f(\boldsymbol{x}))- L(y, f_2(\boldsymbol{x}))|.
\end{align*}


By labelling a small random subset $\mathcal{I}$ of differences $\mathcal{I}_\Delta$ we can estimate  $\Delta R_{\mathcal{I}_\Delta}$ without looking through the entire set of differences:

\begin{align*}
\widehat{\Delta R}_{\mathcal{I}_\Delta} \approx \frac{1}{|\mathcal{I}|}\cdot \sum_{i\in \mathcal{I}} L(y_i, f_1(\boldsymbol{x}_i)))-L(y_i, f_2(\boldsymbol{x}_i)))\enspace.
\end{align*}


## Variance estimation through moment matching


Note that $\Delta R_{\mathcal{I_\Delta}}$ is population average whereas $\widehat{\Delta R}_{\mathcal{I_\Delta}}$ is sample average and thus we can use standard techniques for estimating their difference. 

However, note that the true risk difference $\Delta R$ also has the similar decomposition as the difference between empirical risks 

\begin{align*}
\Delta R = \mathbf{E}(L(y, f_1(\boldsymbol{x})-L(y, f_2(\boldsymbol{x})))) = \Pr[f_1(\boldsymbol{x})\neq f_2(\boldsymbol{x})]\cdot\mathbf{E}(L(y, f_1(\boldsymbol{x})-L(y, f_2(\boldsymbol{x})))| f_1(\boldsymbol{x})\neq f_2(\boldsymbol{x})) 
\end{align*}

In particular note that due to the central limit therem 

\begin{align*}
\hat{p}_\Delta&\to \Pr[f_1(\boldsymbol{x})\neq f_2(\boldsymbol{x})]=:p_\Delta\\
\widehat{\Delta R}_{\mathcal{I_\Delta}} &\to \mathbf{E}(L(y, f_1(\boldsymbol{x})-L(y, f_2(\boldsymbol{x})))| f_1(\boldsymbol{x})\neq f_2(\boldsymbol{x}))
\end{align*}

and we can use moment matching to estimate convergence

\begin{align*}
\mathbf{Var}(\hat{p}_\Delta)&\approx \frac{\hat{p}_\Delta(1-\hat{p}_\Delta)}{N}\\
\mathbf{Var}(\widehat{\Delta R}_{\mathcal{I_\Delta}})&\approx \frac{EmpVar}{|\mathcal{I}|} 
\end{align*}

where $EmpVar$ is the empirical variance of terms $L(y_i, f(\boldsymbol{x_i}))- L(y_i, f_2(\boldsymbol{x_i}))$ for $i\in \mathcal{I}$.
Note that the variance still proportional to $\frac{1}{|\mathcal{I}|}$ 
However, the rescaling factor $p_\Delta$ increases absolute precision when $p_\Delta$ is small -- the difference between the new and the old model is small.
