Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mAP Computation vs Pycocotools #7

Closed
xyutao opened this issue Sep 6, 2018 · 20 comments
Closed

mAP Computation vs Pycocotools #7

xyutao opened this issue Sep 6, 2018 · 20 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@xyutao
Copy link

xyutao commented Sep 6, 2018

The mAP computation code is similar as https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/959e0ff43f5b82bdacef87f4240bae8415eac45b/test.py#L69

It is incorrect to average the AP for each sample, because AP is computed per-class. The right way is to rank all detected instances across the whole test set for each object class, compute AP for each class, and then average the AP.

@glenn-jocher
Copy link
Member

Thanks for the feedback. I opened Issue #5 about this earlier. Currently only one precision-recall curve is generated per image in test.py, whereas like you say I believe we want one for each class in each image, and then the average of those APs is the mAP for that image.

I can try and make this correction myself, or we could try and use an off-the-shelf solution, though that would require more imports. I was studying this link to learn more: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

@glenn-jocher
Copy link
Member

After reviewing more examples, I think I can copy the methods in this repo:
https://github.com/rafaelpadilla/Object-Detection-Metrics

There is another main difference: the mAP should be calculated across all images at once, rather than once per image the way it is now. So I'm going to try to fully replace the mAP code with a new one that calculates accumulated TP and FP vectors for each class, then produces 80 precision and recall curves for all the objects in the 5000 validation images at once.

@glenn-jocher glenn-jocher added the bug Something isn't working label Sep 9, 2018
@glenn-jocher glenn-jocher self-assigned this Sep 9, 2018
@xyutao
Copy link
Author

xyutao commented Sep 10, 2018

Yeah, the evaluation code in that repo is correct. Looking forward for your updates!

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 10, 2018

It looks like the original code for AP from recall-precision is fine:

def compute_ap(recall, precision):

So I left it alone and created a new function to call it once per class in commit c43be7b:

def ap_per_class(tp, conf, pred_cls, target_cls):

    # Find unique classes
    unique_classes = np.unique(np.concatenate((pred_cls, target_cls), 0))

    # Create Precision-Recall curve and compute AP for each class
    ap = []
    for c in unique_classes:
        i = pred_cls == c
        n_gt = sum(target_cls == c)  # Number of ground truth objects

        if sum(i) == 0:
            ap.append(0)
        else:
            # Accumulate FPs and TPs
            fpa = np.cumsum(1 - tp[i])
            tpa = np.cumsum(tp[i])

            # Recall
            recall = tpa / (n_gt + 1e-16)

            # Precision
            precision = tpa / (tpa + fpa)

            # AP from recall-precision curve
            ap.append(compute_ap(recall, precision))

When I re-evaluate mAP it drops from 58.1 to 56.7 with this method however. Darknet reports 57.9. I currently combine true and predicted classes into the list of classes evaluated per image, perhaps I should only be using one or the other. I will have to experiment some more.

@xyutao
Copy link
Author

xyutao commented Sep 11, 2018

Maybe you should perform per-class rank ordering instead of per-image rank ordering.

Taking VOC for example, the evaluation code will first produce a per-class prediction list over the whole test-set in the format (image_id, score, x0, y0, x1, y1), like:

0000.jpg 0.98 100 100 200 200 # the 1st instance of image 0000.jpg
0000.jpg 0.51 10 10 1000 1000 # the last instance of image 0000.jpg
0001.jpg 0.78 100 100 200 200 # the 1st instance of image 0001.jpg
0001.jpg 0.05 10 10 1000 1000 # the last instance of image 0001.jpg
...

then perform rank ordering for all instances:
image-id score
0000.jpg 0.98
0001.jpg 0.78
0000.jpg 0.51
...
0001.jpg 0.05

then compute and accumulate TPs and FPs ... In this way, the mAP should be higher than per-image rank ordering (and no doubt the authors of yolo said mAP is screwed up xD)

@glenn-jocher glenn-jocher added the help wanted Extra attention is needed label Sep 21, 2018
@glenn-jocher
Copy link
Member

@xyutao I'm updating the mAP code, to both add corrections to the repo mAP calculation, and also to output the COCO JSON file to pass to allow the cocoapi to compute the official mAP.

Before you recommended switching to per class rank ordering from per image rank ordering. Is this still your recommendation? Do you know if this is how COCO computes mAP?

@okanlv
Copy link

okanlv commented Feb 25, 2019

@glenn-jocher
Copy link
Member

@xyutao @okanlv I just noticed, the pycocotools demo notebook is selecting a subset of the entire validation set, just as we want for yolov3, since darknet only validates on the 5000 images in 5k.txt. You can see in the notebook that pycocotools states it is running a per image evaluation.

https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb
screenshot 2019-02-25 at 22 55 15

@guagen
Copy link

guagen commented Mar 14, 2019

@xyutao ,你搞明白这个问题了吗?按照传统的map计算方法的话,的确应该是按照你说的就是把所有图片中的可能物体都检测出来然后根据confidence进行排序,然后依次选取他们作为正例样本并计算召回率和准确率,最后计算每一个类别的ap。不过我看了这个代码作者的相关回答并结合coco api的详细代码后发现或许计算每一张图片的ap再平均到一起或许才是coco的map计算方法。但我目前没看到相关的正式文档去准确描述coco的map计算方法,最起码找遍中文搜索的目标检测map计算描述后发现都是传统的计算方法

@xyutao
Copy link
Author

xyutao commented Mar 14, 2019

@guagen COCO的API也是用传统方法算的。它先调用evaluateImg函数,对单张图的每个类,计算各检测框和真值的match情况,然后调用accumulate函数,对同一个类的所有图片的match结果进行合并,并按照检测框的得分进行降序,最后再统一计算precision-recall。

evaluateImg函数的返回结果详见:https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L302

对单个类合并所有图片检测框得分的代码:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L363

对检测框得分降序排列的代码:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L367

@xyutao
Copy link
Author

xyutao commented Mar 14, 2019

@glenn-jocher The per-image evaluation just matches the detections and gt for each category, as shown in the evaluateImg function:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L236

The recall and precision are computed per category, by accumulating the matching and ranking the detection scores for all images of the category. See:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L363

@guagen
Copy link

guagen commented Mar 14, 2019

@xyutao 好吧,头疼,看来这个代码的作者写错代码了,而且按照他的这种写法,得出的map不等于每一类的单独ap加到一起再除以类别总数

@xyutao
Copy link
Author

xyutao commented Mar 14, 2019

@glenn-jocher Here I paste the key code from COCO API for accumulating dets as follows:
(ref: https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L354)
image

In this fraction, k_list stores the category ids, a_list stores the area ranges, m_list stores max detections, i_list stores the image ids. The outside for-loop is per-category iteration, while the inside for-loop is to accumulate the matching results of each image:
E = [self.evalImgs[Nk + Na + i] for i in i_list]
then merge the scores of all detections:
dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
and rank them in decline order:
inds = np.argsort(-dtScores, kind='mergesort')
The matching results pre-computed at evaluateImg are merged:
dtm = np.concatenate([e['dtMatches'][:,0:maxDet] for e in E], axis=1)[:,inds]
and tps and fps are initialized with the merged matching:
tps = np.logical_and( dtm, np.logical_not(dtIg) )
fps = np.logical_and(np.logical_not(dtm), np.logical_not(dtIg) )
...

@glenn-jocher
Copy link
Member

I think it should be a fairly straightforward change to test.py to calculate the mAP averaged per class rather than averaged per image. I can try and implement this in the next few days. Luckily the new --save-json argument in test.py outputs the official COCO mAP now, so this will make comparison of the updates easier.

We probably want to lower the default --conf-thresh from 0.3 to something like 0.001 as well.

@guagen
Copy link

guagen commented Mar 15, 2019

@glenn-jocher
I can't wait to see these changes. Those will make this code look even better. :)
Anyway, thank for your excellent work.

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 26, 2019

All, I created a new https://github.com/ultralytics/yolov3/tree/map_update branch to test mAP updates. I converted from image-averaging to class-averaging. The result is 0.519 mAP now vs 0.550 pycocotools mAP using official yolov3.weights. I'm not sure where the discrepancy lies.

This mainly affects custom data, as for COCO data we can simply use --save-json flag in test.py, which produces an essentially identical 0.550 mAP to the reported 0.553 mAP . One option for custom data would be to attempt to create both predictions and annotations json files, and then to use pycocotools with both of the json files. I tried this, but ran into problem producing the annotations file, it seems more complicated than the targets json produced by --save-json.

Another item is that the previous mAP calculation could operate at a reasonable conf_thres of 0.1-0.5. The new mAP calculation requires an extremely low conf_thres for best results (as is the case with pycocotools mAP as well), around conf_thres=0.001, which takes longer to compute (from 1 min per epoch to 6 minutes on a V100). This produces an extreme number of False Positives (FPs), all requiring NMS (which is currently on CPU). I estimate that for each true detection there are about 15 false positives. It's beyond me how this garbage dump of FPs has been prioritized by the COCO organizers as their metric of choice, but it seems to be the only number anyone cares about.

rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')

Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)

      Image      Total          P          R        mAP
100%|████████████████████████████████████████████████████████████████████████████| 157/157 [06:34<00:00,  1.93s/it]
       5000       5000     0.0865      0.727      0.519

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.550
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.336
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.267
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.408
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

@glenn-jocher glenn-jocher changed the title mAP computation is incorrect mAP Computation vs Pycocotools Mar 26, 2019
@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 26, 2019

UPDATE: difference narrowed down to 0.531 (repo calculation) vs 0.551 (pycocotools). The obj_conf used affects the mAP: whether it is multiplied by class_conf, and if so whether that class_conf is produced by sigmoid or softmax.

rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')

Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
      Image      Total          P          R        mAP
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [07:00<00:00,  2.09s/it]
       5000       5000     0.0865      0.727      0.531

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.551
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.143
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.267
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.407
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

@glenn-jocher
Copy link
Member

We've made great efforts to align the repo mAP with the COCO mAP. It's not perfect, but the current result seems to steadily track about 2% lower than the COCO mAP as the epochs trend higher. We are going to stop updating the repo mAP code now, merge the development branch with the master and release v4.0.

This plot shows v4.0 training with the repo mAP (blue) overlaid with the pycocotools mAP (orange), using --conf_thres 0.1 and default training settings. We decided to use --conf_thres 0.1 to increase test speed during training, while leaving the test.py default at --conf_thres 0.001 for the highest mAP when run by hand later on.
Screenshot 2019-03-29 at 12 13 14

@glenn-jocher
Copy link
Member

UPDATE: difference narrowed down to 0.550 (repo calculation) vs 0.549 (pycocotools).

python3 test.py
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=False, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_process
or_count=80)
      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [06:05<00:00,  1.82s/it]
       5000       5000       0.11      0.746       0.55
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.454
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.266
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.406
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.236
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.466
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.586

Results on first part of Single Class Tutorial (in Wiki). Blue is repo mAP, orange is pycocotools mAP.
Screenshot 2019-03-29 at 12 13 14

@glenn-jocher
Copy link
Member

Final results are in, and PR #176 complete. Repo mAP now aligns with COCO mAP under most circumstances to within 1%. Also mAP output now exceeds yolov3 darknet published results. I will close the issue finally unless there are any other questions.

ultralytics/yolov3 with pycocotools darknet/yolov3
YOLOv3-320 51.8 51.5
YOLOv3-416 55.4 55.3
YOLOv3-608 58.2 57.9
sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
# bash yolov3/data/get_coco_dataset.sh
sudo rm -rf cocoapi && git clone https://github.com/cocodataset/cocoapi && cd cocoapi/PythonAPI && make && cd ../.. && cp -r cocoapi/PythonAPI/pycocotools yolov3
cd yolov3

python3 test.py --save-json --conf-thres 0.001 --img-size 416
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [08:34<00:00,  2.53s/it]
       5000       5000     0.0896      0.756      0.555
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.312
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.317
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.145
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.452
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.411
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.244
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.477
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.587
 
python3 test.py --save-json --conf-thres 0.001 --img-size 608 --batch-size 16
Namespace(batch_size=16, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=608, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 313/313 [08:54<00:00,  1.55s/it]
       5000       5000     0.0966      0.786      0.579
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.331
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.582
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.344
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.198
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.362
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.427
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.281
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.437
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.463
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.494
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.577

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants