It might be a good idea to read the source code to understand how mAP is calculated.
And I think it's important to know some tips at the same time.

1. [mAP understanding with code](#mAP-understanding-with-code)
2. [The lower the confidence threshold is better](#The-lower-the-confidence-threshold-is-better)
3. [Class dependence is linear](#Class-dependence-is-linear)
4. [Small sample class is important](#Small-sample-class-is-important)

### Credits

I borrowed @ZFTurbo's great code for mAP calculation here.

https://github.com/ZFTurbo/Mean-Average-Precision-for-Boxes

pycocotools is often used for mAP calculation, but I recommend this one.
This is very easy to use neither too much nor too little.

<pre>
!pip install map-boxes

from map_boxes import mean_average_precision_for_boxes

ann = ann[['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax']].values
det = det[['ImageID', 'LabelName', 'Conf', 'XMin', 'XMax', 'YMin', 'YMax']].values
mean_ap, average_precisions = mean_average_precision_for_boxes(ann, det)
</pre>

In addition, as we can see below, this source code is very simple and easy to read.

# mAP-understanding-with-code
I would like to explain the source code for computing mAP.
If you don't know anything about mAP, try reading the documentation first.

[mAP (mean Average Precision) for Object Detection](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173).

[Breaking Down Mean Average Precision (mAP)](https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52).



In [None]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from IPython.display import Image

### IOU Calculation

Compute all the IOUs between the N annotation boxes and the query_boxes.


In [None]:
Image("../input/vinbig-validation-data/iou.png")

In [None]:
def compute_overlap(boxes, query_boxes):
    """
    Args
        boxes:       (N, 4) ndarray of float
        query_boxes: (4)    ndarray of float
    Returns
        overlaps: (N) ndarray of overlap between boxes and query_boxes
    """
    N = boxes.shape[0]
    overlaps = np.zeros((N), dtype=np.float64)
    box_area = (
        (query_boxes[2] - query_boxes[0]) *
        (query_boxes[3] - query_boxes[1])
    )
    for n in range(N):
        iw = (
            min(boxes[n, 2], query_boxes[2]) -
            max(boxes[n, 0], query_boxes[0])
        )
        if iw > 0:
            ih = (
                min(boxes[n, 3], query_boxes[3]) -
                max(boxes[n, 1], query_boxes[1])
            )
            if ih > 0:
                ua = np.float64(
                    (boxes[n, 2] - boxes[n, 0]) *
                    (boxes[n, 3] - boxes[n, 1]) +
                    box_area - iw * ih
                )
                overlaps[n] = iw * ih / ua
    return overlaps

### check if true positive or false positive

All detected bounding boxes will be assigned to the GT box that has the largest IOU.
Then check whether they are TP or FP by iou_threshold.
Here, if multiple detected bounding boxes are mapped to the same GT box, then only the highest scored bounding box needs to be assigned to that GT.
So note that **the detections passed to this function need to be pre-sorted**.

In [None]:
def cehck_if_true_or_false_positive(annotations, detections, iou_threshold):
    annotations = np.array(annotations, dtype=np.float64)
    scores = []
    false_positives = []
    true_positives = []
    detected_annotations = [] # a GT box should be mapped only one predicted box at most.
    for d in detections:
        scores.append(d[4])
        if len(annotations) == 0:
            false_positives.append(1)
            true_positives.append(0)
            continue
        overlaps = compute_overlap(annotations, d[:4])
        assigned_annotation = np.argmax(overlaps)
        max_overlap = overlaps[assigned_annotation]
        if max_overlap >= iou_threshold and assigned_annotation not in detected_annotations:
            false_positives.append(0)
            true_positives.append(1)
            detected_annotations.append(assigned_annotation)
        else:
            false_positives.append(1)
            true_positives.append(0)
    return scores, false_positives, true_positives

### Average Precision Calculation

This part is a bit complicated, but what is being done here is simply calculation of area under curve of presision / recall curve.

In [None]:
def _compute_ap(recall, precision):
    """ Compute the average precision, given the recall and precision curves.
    Code originally from https://github.com/rbgirshick/py-faster-rcnn.
    # Arguments
        recall:    The recall curve (list).
        precision: The precision curve (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """
    # correct AP calculation
    # first append sentinel values at the end
    mrec = np.concatenate(([0.], recall, [1.]))
    mpre = np.concatenate(([0.], precision, [0.]))

    # compute the precision envelope
    for i in range(mpre.size - 1, 0, -1):
        mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

    # to calculate area under PR curve, look for points
    # where X axis (recall) changes value
    i = np.where(mrec[1:] != mrec[:-1])[0]

    # and sum (\Delta recall) * prec
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap

### Utility Functions

These functions will group bounding boxes by image ID and label.

In [None]:
def get_real_annotations(table):
    res = dict()
    ids = table['ImageID'].values.astype(np.str)
    labels = table['LabelName'].values.astype(np.str)
    xmin = table['XMin'].values.astype(np.float32)
    xmax = table['XMax'].values.astype(np.float32)
    ymin = table['YMin'].values.astype(np.float32)
    ymax = table['YMax'].values.astype(np.float32)

    for i in range(len(ids)):
        id = ids[i]
        label = labels[i]
        if id not in res:
            res[id] = dict()
        if label not in res[id]:
            res[id][label] = []
        box = [xmin[i], ymin[i], xmax[i], ymax[i]]
        res[id][label].append(box)

    return res

def get_detections(table):
    res = dict()
    ids = table['ImageID'].values.astype(np.str)
    labels = table['LabelName'].values.astype(np.str)
    scores = table['Conf'].values.astype(np.float32)
    xmin = table['XMin'].values.astype(np.float32)
    xmax = table['XMax'].values.astype(np.float32)
    ymin = table['YMin'].values.astype(np.float32)
    ymax = table['YMax'].values.astype(np.float32)

    for i in range(len(ids)):
        id = ids[i]
        label = labels[i]
        if id not in res:
            res[id] = dict()
        if label not in res[id]:
            res[id][label] = []
        box = [xmin[i], ymin[i], xmax[i], ymax[i], scores[i]]
        res[id][label].append(box)
    return res


### mean Average Precision Calculation
Now we can calculate mAP.

In [None]:
def mean_average_precision_for_boxes(ann, pred, iou_threshold=0.4, exclude_not_in_annotations=False, verbose=True):
    """
    :param ann: path to CSV-file with annotations or numpy array of shape (N, 6)
    :param pred: path to CSV-file with predictions (detections) or numpy array of shape (N, 7)
    :param iou_threshold: IoU between boxes which count as 'match'. Default: 0.5
    :param exclude_not_in_annotations: exclude image IDs which are not exist in annotations. Default: False
    :param verbose: print detailed run info. Default: True
    :return: tuple, where first value is mAP and second values is dict with AP for each class.
    """

    valid = pd.DataFrame(ann, columns=['ImageID', 'LabelName', 'XMin', 'XMax', 'YMin', 'YMax'])
    preds = pd.DataFrame(pred, columns=['ImageID', 'LabelName', 'Conf', 'XMin', 'XMax', 'YMin', 'YMax'])
    ann_unique = valid['ImageID'].unique()
    preds_unique = preds['ImageID'].unique()

    if verbose:
        print('Number of files in annotations: {}'.format(len(ann_unique)))
        print('Number of files in predictions: {}'.format(len(preds_unique)))

    # Exclude files not in annotations!
    if exclude_not_in_annotations:
        preds = preds[preds['ImageID'].isin(ann_unique)]
        preds_unique = preds['ImageID'].unique()
        if verbose:
            print('Number of files in detection after reduction: {}'.format(len(preds_unique)))

    unique_classes = valid['LabelName'].unique().astype(np.str)
    if verbose:
        print('Unique classes: {}'.format(len(unique_classes)))

    all_detections = get_detections(preds)
    all_annotations = get_real_annotations(valid)
    if verbose:
        print('Detections length: {}'.format(len(all_detections)))
        print('Annotations length: {}'.format(len(all_annotations)))

    average_precisions = {}
    for zz, label in enumerate(sorted(unique_classes)):

        # Negative class
        if str(label) == 'nan':
            continue

        false_positives = []
        true_positives = []
        scores = []
        num_annotations = 0.0

        for i in range(len(ann_unique)):
            detections = []
            annotations = []
            id = ann_unique[i]
            if id in all_detections:
                if label in all_detections[id]:
                    detections = all_detections[id][label]
            if id in all_annotations:
                if label in all_annotations[id]:
                    annotations = all_annotations[id][label]

            if len(detections) == 0 and len(annotations) == 0:
                continue
                
            num_annotations += len(annotations)
            
            scr, fp, tp = cehck_if_true_or_false_positive(annotations, detections, iou_threshold)
            scores += scr
            false_positives += fp
            true_positives += tp

        if num_annotations == 0:
            average_precisions[label] = 0, 0
            continue

        false_positives = np.array(false_positives)
        true_positives = np.array(true_positives)
        scores = np.array(scores)

        # sort by score
        indices = np.argsort(-scores)
        false_positives = false_positives[indices]
        true_positives = true_positives[indices]

        # compute false positives and true positives
        false_positives = np.cumsum(false_positives)
        true_positives = np.cumsum(true_positives)

        # compute recall and precision
        recall = true_positives / num_annotations
        precision = true_positives / np.maximum(true_positives + false_positives, np.finfo(np.float64).eps)

        # compute average precision
        average_precision = _compute_ap(recall, precision)
        average_precisions[label] = average_precision, num_annotations, precision, recall
        if verbose:
            s1 = "{:30s} | {:.6f} | {:7d}".format(label, average_precision, int(num_annotations))
            print(s1)

    present_classes = 0
    precision = 0
    for label, (average_precision, num_annotations, _, _) in average_precisions.items():
        if num_annotations > 0:
            present_classes += 1
            precision += average_precision
    mean_ap = precision / present_classes
    if verbose:
        print('mAP: {:.6f}'.format(mean_ap))
    return mean_ap, average_precisions

One of the most frequently asked questions about mAP is where the confidence scores are used.
Confidence scores is used to sort for the following two purposes.
1. detections shold be sorted by confidence scores before cehck_if_true_or_false_positive().
2. false_positives and ture_positives list shold be sorted by confidence scores before Average Precision Calculation.

That's it.
Sometimes reading the source code is the best way to understand.
I hope this helps to understand mAP!

# The-lower-the-confidence-threshold-is-better

The lower the confidence threshold, the higher the mAP we will always get.

In [None]:
# predicted boxes
df_pred = pd.read_csv('../input/vinbig-validation-data/valid_cv5.csv')
df_pred = df_pred.sort_values('conf', ascending=False)

# GT annotation
df = pd.read_csv('../input/vinbigdata-chest-xray-abnormalities-detection/train.csv')
df.loc[df['class_id'] == 14, 'x_min'] = 0
df.loc[df['class_id'] == 14, 'y_min'] = 0
df.loc[df['class_id'] == 14, 'x_max'] = 1
df.loc[df['class_id'] == 14, 'y_max'] = 1

# use only first rad here
tgt_rad = 0
filterd_df_list = []
for image_id, df_img in df.groupby('image_id'):
    rad_ids = df_img['rad_id'].unique()
    rad_id = rad_ids[tgt_rad]
    filterd_df_list.append(df_img[df_img['rad_id'] == rad_id])
df_anno = pd.concat(filterd_df_list).reset_index(drop=True)

class_name = df[['class_name','class_id']].set_index('class_id').to_dict()['class_name']

In [None]:
df_pred_thre001 = df_pred[df_pred.conf > 0.01]
df_pred_thre05 = df_pred[df_pred.conf > 0.5]
pred_thre001 = df_pred_thre001[['image_id', 'class_id', 'conf','x_min','x_max','y_min','y_max']].values
pred_thre05 = df_pred_thre05[['image_id', 'class_id', 'conf','x_min','x_max','y_min','y_max']].values
anno = df_anno[['image_id', 'class_id','x_min','x_max','y_min','y_max']].values

mean_ap_thre001, average_precisions_thre001 = mean_average_precision_for_boxes(anno, pred_thre001, verbose=False)
mean_ap_thre05, average_precisions_thre05 = mean_average_precision_for_boxes(anno, pred_thre05, verbose=False)
print('mAP with threshold 0.01', round(mean_ap_thre001,2))
print('mAP with threshold 0.5 ', round(mean_ap_thre05,2))

This reason can be clearly understood by comparing the following two (right and loft) figures and the fact that AP is the area under the precision/recall curve.
**Using a large threshold is the same thing as using only part of the area under the precision/recall curve.**

In [None]:
# prot precision/recall curve
def plot_precision_recall_curve(precision1, recall1, precision2, recall2, thre1, thre2):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))

    ax1.plot(recall1, precision1)
    ax1.step(recall1, precision1, color='b', alpha=0.2, where='post')
    ax1.fill_between(recall1, precision1, alpha=0.2, color='b')
    ax1.set_title(f"threshold {thre1}")
    ax1.set(xlabel='recall', ylabel='precision')
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    #ax1.invert_xaxis()
    
    ax2.plot(recall2, precision2)
    ax2.step(recall2, precision2, color='b', alpha=0.2, where='post')
    ax2.fill_between(recall2, precision2, alpha=0.2, color='b')
    ax2.set_title(f"threshold {thre2}")
    ax2.set(xlabel='recall', ylabel='precision')
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    
    fig.tight_layout()
    plt.show()

In [None]:
for i in range(15):
    idx = str(i)
    average_precision001, _, precision001, recall001 = average_precisions_thre001[idx]
    average_precision05, _, precision05, recall05 = average_precisions_thre05[idx]
    print(f'class name:{class_name[i]}, AP with threshold0.01:{average_precision001}, AP with threshold0.5:{average_precision05}')
    plot_precision_recall_curve(precision001, recall001, precision05, recall05, 0.01, 0.5)

So we should use as low a confidence threshold as possible.
The only disadvantage is that if we lower the threshold and increase the number of predictions, inference time will increases.

From the left graph of the threshold value of 0.01, we can see that if we make the threshold value smaller, there is not much room for improvement.
Therefore, even though the smaller the threshold value, the better the mAP becomes, the improvement is limited.
The max mAP score depends on the shape of precision/recall curve.

# Class-dependence-is-linear
We can expect the class dependency of mAP to be linear.
<pre>
mAP(all class) = mAP(class_id == n) + mAP(class_id != n)
</pre>


In [None]:
df_pred_all = df_pred
df_pred_class0 = df_pred[df_pred.class_id == 0]
df_pred_not_class0 = df_pred[df_pred.class_id != 0]

pred_all = df_pred_all[['image_id', 'class_id', 'conf','x_min','x_max','y_min','y_max']].values
pred_class0 = df_pred_class0[['image_id', 'class_id', 'conf','x_min','x_max','y_min','y_max']].values
pred_not_class0 = df_pred_not_class0[['image_id', 'class_id', 'conf','x_min','x_max','y_min','y_max']].values
anno = df_anno[['image_id', 'class_id','x_min','x_max','y_min','y_max']].values

mean_ap_all, _ = mean_average_precision_for_boxes(anno, pred_all, verbose=False)
mean_ap_class0, _ = mean_average_precision_for_boxes(anno, pred_class0, verbose=False)
mean_ap_not_class0, _ = mean_average_precision_for_boxes(anno, pred_not_class0, verbose=False)
print('all', mean_ap_all)
print('sum', mean_ap_class0 + mean_ap_not_class0)

This is obvious from the definition of mAP.
So you can check the scores for each class individually of not only CV but also LB.
It is difficult to make local CV for some competition like [Human Protein Atlas - Single Cell Classification](https://www.kaggle.com/c/hpa-single-cell-image-classification), checking for each class of LB is sometime helps.

# Small-sample-class-is-important

Small sample class is equally important as big class. Because:
<pre>
max(mAP(a small class)) == max(mAP(a big class))
</pre>

Furthermore big classes are generally easier and have less room for improvement and smaller classes are generally more difficult and there is more room for improvement.
So we need to pay attention to small class.

If mAP is mean "weighted" average precision, this would be not true.

Normally, this weight information does not disclosed.
However, it is possible to guess the weights based on the second feature (class dependence is linear) and the public leaderboard score again.

I hope this information is useful especially for beginners.

For explanation of mAP itself, please see following.

[github repo by ZFTurbo](https://github.com/ZFTurbo/Mean-Average-Precision-for-Boxes)

[mAP (mean Average Precision) for Object Detection](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173).

[Breaking Down Mean Average Precision (mAP)](https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52).

[Explanation of scoring metric (mAP@0.4)](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/212287) by @pestipeti.
