# Average Precision

## Introduction

Let's assume you have a few binary classifiers which tries to distinguish some 'positive class' (called 'signal' here) from 'negative class' (called 'background' here). 

Your aim:
* choose the best classifier

* choose a 'cut' or 'threshold' on the best classifier's score to minimize misclassification rate

We briefly introduce standard graphical methods - 'receiver operating characteristic (ROC) curve' and 'precision-recall curve' and their associated metrics to compare classifiers. We dicuss which method is better, and how to choose the best threshold for a classifier for a given dataset.

In [32]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from ipywidgets import *

The function below is used to generate classification scores of a class as gaussian distribution.

In [33]:
def generate_scores(n, mu, sigma):
    np.random.seed(0)
    X = np.random.normal(mu, sigma, int(n))
    return X

## Definitions

For a given threshold: 
<div class="alert alert-block alert-danger">
(TODO: add an interactive sliding threshold for two distributions)
</div>
    
* Precision = $\frac{TP}{TP+FP}$ = $\frac{positives\ correctly\ classified}{total\ data\ points\ classsified\ as\ positives}$

* True positive rate ($TPR$) or recall = $\frac{TP}{TP+FN}$ = $\frac{positives\ correctly\ classified}{total\ positives}$ 

* False negative rate ($FNR$) = $\frac{FN}{FN+TP}$ = $\frac{positives\ incorrectly\ classified\ as\ negatives}{total\ positives}$ = $1 - TPR$

* True negative rate ($TNR$) = $\frac{TN}{FP+TN}$ = $\frac{negatives\ correctly\ classified}{total\ negatives}$

* False positive rate ($FPR$) = $\frac{FP}{FP+TN}$ = $\frac{negatives\ incorrectly\ classified\ as\ positives}{total\ negatives}$= $1 - TNR$


In all the above metrics, except precision, the denominator is the the total strength of the true class of the numerator, so the last four metrics can be defined on the probability distribution of the score of the positive and negative classes. 
<div class="alert alert-block alert-danger">
(TODO: to explain the lines properly when the interactive sliding threshold plot is added)
</div>

## Receiver operating characteristic (ROC) curve

<div class="alert alert-block alert-danger">
(TODO: show positive and negative pdfs with two different overlaps (two different classifiers)  and a sliding threshold which shows the corresponding point on the ROC curve)
</div>

ROC curve can be defined as the curve between the rate of positive examples labelled as something (ie., TPR or FNR) and th rate of negative examples labelled as something (ie., TNR or FPR).

<div class="alert alert-block alert-danger">
(TODO: For now showing the wikipedia fig, replace with interactive version)
</div>

 ![title](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ROC_curves.svg/800px-ROC_curves.svg.png)


In the above definition of ROC curve, when there is no overlap between the probability distribution of the scores of the positive and negative class, the ROC curve should touch the (TPR=100%, FPR=0%) point and have an area under the curve (AUC) as 1.

As the overlap between the probability distribution of the scores of positive and negative class increases, the area under the curve become smaller than 1.

Therefore area under the curve (AUC) for ROC curve can be used as a metric to compare classiiers. In this case, greater the AUC better the classifier.

For other definitions of ROC like $FRP$ vs. $FNR$, smaller the AUC better the classifier.

The AUC in the above figure represents the average drift in the rank of the positive examples if you were to rank all the examples based on their scores (assuming the classifier gives hjigher score to the positive class). This is true as the AUC can be interpreted as the [nomalized Mann-Whitney U score](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Area-under-curve_(AUC)_statistic_for_ROC_curves) between the score distributions of the positive and negative class.

## Precision-recall (PR) curve

<div class="alert alert-block alert-danger">
(TODO: show positive and negative distributions with two different overlaps (two classifiers), keep the rel. normalizations the same, and with sliding threshold show the point on the PR curve, for now have scikitlearn fig as placeholder)
</div>

![title](https://scikit-learn.org/stable/_images/sphx_glr_plot_precision_recall_001.png)

As Precision is also a function of the threshold, the curve between Precision and Recall is used to compare different classifiers, similar to ROC curve. 
As we decrease the threshold (loosen our acceptance), we include more true positives so the true positive rate of the recall increases, while the increase in true positives w.r.t false positives would depend on the overlap of the score distribution of the positive and negative classes, ie the precision may increase or decrease depending on the overlap of the two class score distribution. Eventually, all the positive examples would be exhausted and further loosening the acceptance (decreasing the threshold) would just increase the number of false positivites, ie eventually the precision would decrease. Since the precision is sensitive to the overlap bewteen the two class score distributions, it can also compare different classifiers.

So, for a given threshold if the precision is higher, the overlap of the negative class in the selection is smaller. So the PR curve can be summarized as the precision averaged over threshold. As true positive count is a function of threshold, the average precision (AP) is the area under the PR curve. Higher the AP better is the classifier.

## When is PR curve better than ROC curve?

The PR curve looks at the precision which is a function of the actual yields (or counts) of the signal (positive class) and background (negative class). Therefore, the PR curve is better to understand the overlap between the actual distribution of the score of the signal and background by a classifier while the ROC curve can be only used to understand the overlap between the probability distributions of the scores of the positive and negative classes. The PR curve (and AP metric) is therefore useful in comparing classifiers when the classes in the dataset are imbalanced, such as in object detection where most of the area in an image is background, and in information retrival, where too most of the text is background. The ROC curve (and AUC metric) is useful in the case when the signal and background classes are balanced.

You can check it yourself by changing the relative normalization (or yield) of the background to signal class using the slider at the bottom of the plot below:

In [34]:
fig = plt.figure(figsize=(7, 5))

#actual distribution
ax_actdist = fig.add_subplot(2,2,1)


#probability distribution
ax_pdf = fig.add_subplot(2,2,2)

#precision-recall curve
ax_pr = fig.add_subplot(2,2,3)

#roc curve
ax_roc = fig.add_subplot(2,2,4)

N = 1000
Xpos = generate_scores(N, 1, 1)

def update(bkg_to_sig_ratio):
    global Xneg
    Xneg = generate_scores(bkg_to_sig_ratio*N, -1, 1)
    y = np.array([-1]*len(Xneg) + [1]*len(Xpos))
    X = np.concatenate((Xneg, Xpos))

    precision, recall, thresholds = metrics.precision_recall_curve(y, X)
    ap = metrics.auc(recall, precision)
    fpr, tpr, thresholds = metrics.roc_curve(y, X)
    auc = metrics.auc(tpr, fpr)

    ax_actdist.clear()
    ax_pdf.clear()
    ax_pr.clear()
    ax_roc.clear()
    
    actdist = ax_actdist.hist(Xpos, bins=50, alpha=0.5, label="signal")
    actdist = ax_actdist.hist(Xneg, bins=50, alpha=0.5, label="bkg")
    ax_actdist.set_xlabel('classifier score')
    ax_actdist.set_ylabel('count')
    ax_actdist.title.set_text("actual yield")
    
    pdf = ax_pdf.hist(Xpos, bins=50, alpha=0.5, normed=True, label="signal")
    pdf = ax_pdf.hist(Xneg, bins=50, alpha=0.5, normed=True, label="bkg")
    ax_pdf.set_xlabel('classifier score')
    ax_pdf.set_ylabel('probability density')
    ax_pdf.title.set_text("pdfs")
    
    pr = ax_pr.plot(recall, precision)
    ax_pr.title.set_text("precision-recall (AP: %.3f)"%ap)
    ax_pr.set_xlabel("recall")
    ax_pr.set_ylabel("precision")
    
    roc = ax_roc.plot(tpr, fpr)
    ax_roc.title.set_text("ROC (AUC: %.3f)"%auc)
    ax_roc.set_xlabel("True Positive Rate (recall)")
    ax_roc.set_ylabel("False Positive Rate")
    
    line_labels = ["signal", "bkg"]
    fig.legend([ax_pdf, ax_actdist], labels=line_labels, loc="upper right", borderaxespad=0.5, title="class")
    
    
    fig.subplots_adjust(right=0.5)
    plt.tight_layout(pad=3.0)
    plt.show()
    fig.canvas.draw_idle()
    


interact(update, bkg_to_sig_ratio=FloatSlider(description='bkg/signal', min=1.0, max=10.1, step=0.5, value=1.0) );

<IPython.core.display.Javascript object>

aW50ZXJhY3RpdmUoY2hpbGRyZW49KEZsb2F0U2xpZGVyKHZhbHVlPTEuMCwgZGVzY3JpcHRpb249dSdia2cvc2lnbmFsJywgbWF4PTEwLjEsIG1pbj0xLjAsIHN0ZXA9MC41KSwgT3V0cHV0KCnigKY=


## References
<div class="alert alert-block alert-danger"> 
TODO
</div>