This notebook explores the AP scores for a 90% accurate prediction.

It shows that the AP score is significantly dependent on the confidence scores attributed to false positives and true positives.

Further, it shows that this effect is greater for rarely ocurring classes than for commonly occurring classes.

If the confidence is successfully allocated such that false positives are less confident than true positives, then AP will correspond to the intuitive notion of the accuracy of the test (here 90%).

However, if the confidence is not successfully allocated, then particularly for rare labels, the AP score may be heavily penalised.


Acknowledgments

Thanks to @Tito and @ZFTurbo for ZFTurbo's mAP code and Tito's explanation of it in:

https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/217158

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os


!pip install map-boxes

from map_boxes import mean_average_precision_for_boxes



Start with something simple:

There are 10 images and one label

Each of the ground truth (ann) image has exactly one instance of the label

9 of the images are detected (det)

Bounding boxes are always the same values (like having one cell per image in the HPA competition)

In [None]:
import pandas as pd

ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'XMin':[1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10]
        }

det_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8'],
            'LabelName': ['1','1','1','1','1','1','1','1','1'],
            'Conf':[1,1,1,1,1,1,1,1,1],
            'XMin':[1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])
det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)


So we found nine out of 10 and got 90% score - pretty intuitive so far

Now, lets get 9 right and also detect a false positive (by using a different bounding box)

We give it a lower confidence, (plausible since it is a wrong detection)

In [None]:

ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'XMin':[1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10]
        }

det_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'Conf':[1,1,1,1,1,1,1,1,1,0.9],
            'XMin':[1,1,1,1,1,1,1,1,1,11],
            'XMax':[10,10,10,10,10,10,10,10,10,20],
            'YMin':[1,1,1,1,1,1,1,1,1,11],
            'YMax':[10,10,10,10,10,10,10,10,10,20]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])
det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)

That still got 90% still intuitive

Now let's make some of the true positives less confident than the false positive by changeing 'Conf'

In [None]:

ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'XMin':[1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10]
        }

det_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'Conf':[0.8,0.8,0.8,0.8,1,1,1,1,1,0.9],
            'XMin':[1,1,1,1,1,1,1,1,1,11],
            'XMax':[10,10,10,10,10,10,10,10,10,20],
            'YMin':[1,1,1,1,1,1,1,1,1,11],
            'YMax':[10,10,10,10,10,10,10,10,10,20]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])
det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)

Aha: lost some score on that, now only 0.86

and if we make all our true positives less confident than our false negative:

In [None]:
ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'XMin':[1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10]
        }

det_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'Conf':[0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.9],
            'XMin':[1,1,1,1,1,1,1,1,1,11],
            'XMax':[10,10,10,10,10,10,10,10,10,20],
            'YMin':[1,1,1,1,1,1,1,1,1,11],
            'YMax':[10,10,10,10,10,10,10,10,10,20]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])
det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)

now only 0.81, although we still got 90% right !

ok, so lets model a working detection solution, with the following characteristics:

Given a ground truth positive, detects positive with 90% probability

Given a ground truth negative (using label 18 as in HPA competition for 'Negative' label), detects positive with 10% probability

In [None]:
from random import *

ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1'],
            'XMin':[1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])

image_id_list = []
label_name_list = []
conf_list = []
xmin_list = []
xmax_list = []
ymin_list = []
ymax_list = []

image_id=0

for i, ann in ann_df.iterrows():
    image_id_list.append(str(i))
    ground_truth_label=ann['LabelName']
    rnd=random()
    if (ground_truth_label == '1'):
        if rnd  < 0.9:
            label_name_list.append('1')
        else:
            label_name_list.append('18')
    else:
        if rnd  < 0.9:
            label_name_list.append('18')
        else:
            label_name_list.append('1')
    conf_list.append(random())
    xmin_list.append(1)
    xmax_list.append(10)
    ymin_list.append(1)
    ymax_list.append(10) 


det_data = {'ImageID':   image_id_list,
            'LabelName': label_name_list,
            'Conf':conf_list,
            'XMin':xmin_list,
            'XMax':xmax_list,
            'YMin':ymin_list,
            'YMax':ymax_list
        }


det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)

Running this a few times you will see scores of 1, 0.9, 0.8, 0.7 ... depending on how the random numbers turns out

(note that there are no ground truth negatives here and no differing bounding boxes, so it is like the HPA case of one cell per image, and all cells classified with the label '1' - In this case the AP directly corresponds to the intuition of percentage correct identifications)


Lets add some further ground truth negatives:

In [None]:
ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1','18','18','18','18','18','18','18','18','18','18'],
            'XMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])

image_id_list = []
label_name_list = []
conf_list = []
xmin_list = []
xmax_list = []
ymin_list = []
ymax_list = []

image_id=0

for i, ann in ann_df.iterrows():
    image_id_list.append(str(i))
    ground_truth_label=ann['LabelName']
    rnd=random()
    if (ground_truth_label == '1'):
        if rnd  < 0.9:
            label_name_list.append('1')
        else:
            label_name_list.append('18')
    else:
        if rnd  < 0.9:
            label_name_list.append('18')
        else:
            label_name_list.append('1')
    conf_list.append(random())
    xmin_list.append(1)
    xmax_list.append(10)
    ymin_list.append(1)
    ymax_list.append(10) 


det_data = {'ImageID':   image_id_list,
            'LabelName': label_name_list,
            'Conf':conf_list,
            'XMin':xmin_list,
            'XMax':xmax_list,
            'YMin':ymin_list,
            'YMax':ymax_list
        }


det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)

If you exercise this a few times, you will see that, depending on the random numbers, the score is often high, but can also get quite low

For instance, my first execution of this version gave:

1                              | 0.466667 |      10

18                             | 0.626190 |      10

mAP: 0.546429

So although our detection is designed to behave with 90% accuracy, the AP for label 1 was less that 0.5 and the overall mAP was 0.54 !

Ok, that was just bad luck, so we need to bring probabilities into it.

This is reminiscent of the much cited case of a test for a rare disease where the probability of actually having the disease given that a fairly reliable test was positive is intuitively expected to be higher than it is.  The reason is Bayes theorem and the low probability of actually having the disease and therefore relatively high probability of a false positive test rather than a true positive test.

So now, lets do just that: make label 1 rare, only one in 20:


In [None]:
ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19'],
            'LabelName': ['1','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18'],
            'XMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])

image_id_list = []
label_name_list = []
conf_list = []
xmin_list = []
xmax_list = []
ymin_list = []
ymax_list = []

image_id=0

for i, ann in ann_df.iterrows():
    image_id_list.append(str(i))
    ground_truth_label=ann['LabelName']
    rnd=random()
    if (ground_truth_label == '1'):
        if rnd  < 0.9:
            label_name_list.append('1')
        else:
            label_name_list.append('18')
    else:
        if rnd  < 0.9:
            label_name_list.append('18')
        else:
            label_name_list.append('1')
    conf_list.append(random())
    xmin_list.append(1)
    xmax_list.append(10)
    ymin_list.append(1)
    ymax_list.append(10) 


det_data = {'ImageID':   image_id_list,
            'LabelName': label_name_list,
            'Conf':conf_list,
            'XMin':xmin_list,
            'XMax':xmax_list,
            'YMin':ymin_list,
            'YMax':ymax_list
        }


det_df = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])

print (ann_df)
print (det_df)


mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df)

well, I had to run it quite a few times to get one with a false negative

and then the score was:

Annotations length: 20

1                              | 0.000000 |       1

18                             | 0.814241 |      19

mAP: 0.407121

and when I had a true positive, the high "label 1 score" for detecting label 1 correctly was often dragged down by several false positives that had greater Conf than the true positive


Now let:

p = number of cells having ground truth label 1

n  = number of cells in the test set

then:

expected number of label 1 true positives: 0.9 * p

expected number of label 1 false positives: 0.1 * (n-p)

looking at a specific case where these expected results occur, If all the false positives have a lower 
confidence than all the true positives then the false positives are ignored (end of the curve)
because the precision stays at 1 over the whole range of recall from 0 to 0.9, giving area under the curve of 1 * 0.9 = 0.9

However, if the false positives all have higher confidence than the true positives, then when the first
true positive is found, the accuracy is:

number of correct answers / number of predictions

= 1 / (number of false positives + 1)

and as each further true positive is found (increasing recall), this grows to:

= (number of true positives) / (number of false positives + number of true positives)

= 0.9 * p / (( 0.1 * (n-p)) * (0.9 * p))

= 0.9 * p / ( (0.1 * n) + (0.8 * p) )

= 0.9 / ( 0.1 * (n/p) ) + 0.8 )                  [assuming p != 0 , in which case the accuracy would be zero]

so as p approaches n the score approaches 0.9 / (0.1+0.8) = 1

but as p approaches one, n/p will approch n and the score will become small if n is large

So to conclude: 
The relative confidence allocated to true positives and false positives is important.
If the confidence is successfully allocated such that false positives are less confident than true positives, then AP will correspond to the intuitive notion of the accuracy of the test (here 90%).

However, if the confidence is not successfully allocated, then particularly for rare labels, the AP score may be heavily penalised.


The next two code cells run an experiment to check this in practice, since the above argument doesn't treat the different distributions of conf scores.

In both cells, the function make_det makes 90% correct predictions as above and is called 1000 times to get a representative result over the randomly created predictions and Conf scores.

In both cells n=20, in the first experiment, p=1 (label 1 is rare), and in the seconde p=15 (label 1 is common)

Result:

With p=1 AP is around 0.60 and with p=15 AP is around 0.88 so we see that with the same 90% accuracy of test the AP score of the rare label is lower.


In [None]:
ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19'],
            'LabelName': ['1','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18','18'],
            'XMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])

def make_det(adf):
# produce a set of predictions for the elements of adf with 90% probability of correctness
    image_id_list = []
    label_name_list = []
    conf_list = []
    xmin_list = []
    xmax_list = []
    ymin_list = []
    ymax_list = []

    image_id=0

    for i, ann in adf.iterrows():
        image_id_list.append(str(i))
        ground_truth_label=ann['LabelName']
        rnd=random()
        if (ground_truth_label == '1'):
            if rnd  < 0.9:
                label_name_list.append('1')
            else:
                label_name_list.append('18')
        else:
            if rnd  < 0.9:
                label_name_list.append('18')
            else:
                label_name_list.append('1')
        conf_list.append(random())
        xmin_list.append(1)
        xmax_list.append(10)
        ymin_list.append(1)
        ymax_list.append(10) 


    det_data = {'ImageID':   image_id_list,
                'LabelName': label_name_list,
                'Conf':conf_list,
                'XMin':xmin_list,
                'XMax':xmax_list,
                'YMin':ymin_list,
                'YMax':ymax_list
            }


    ddf = pd.DataFrame (det_data, columns = ['ImageID', 'LabelName','Conf', 'XMin', 'XMax', 'YMin', 'YMax'])
    return ddf

sum_ap = 0
for i in range(1000):
    det_df = make_det(ann_df)
    mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df,verbose = False)
    sum_ap = sum_ap + average_precisions['1'][0]
    

print(sum_ap)    
    
    
    

In [None]:
ann_data = {'ImageID':    ['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19'],
            'LabelName': ['1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','18','18','18','18','18'],
            'XMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'XMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10],
            'YMin':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
            'YMax':[10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]
        }

ann_df = pd.DataFrame (ann_data, columns = ['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])

sum_ap = 0
for i in range(1000):
    det_df = make_det(ann_df)
    mean_ap, average_precisions = mean_average_precision_for_boxes(ann_df, det_df,verbose = False)
    sum_ap = sum_ap + average_precisions['1'][0]
    

print(sum_ap)    