Performance metrics per class for comparison #5880

KristofferK · 2021-12-04T11:08:22Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello,

Is it possible to easily compare the performance on different classes in a multiclass model? I've trained it on a custom dataset with 4 classes. Let's say Class A, B, C, and D.

How would I see e.g. the precision, recall and mAP per class? This is to see if my model is better at a certain class or if it is equally good amongst them all.

I suspect that my model is very good on class A, good on B and C, whole it is mediocre at class D. This is what it looks like when I manually inspect the annotations on the test results. But I'd like the actual performance metrics to support my hypothesis.

Thanks in advance.

Additional

No response

glenn-jocher · 2021-12-04T12:09:19Z

@KristofferK metrics are automatically displayed per class. No action is required.

KristofferK · 2021-12-04T14:31:49Z

Thank you for the very quick response, Glenn!

~~ When I use val.py, I do not seem to be able to use "--task test". Do you have any idea how to fix this? It works with "--task val", "--task study", etc. I am on the latest commit.~~ (Fixed: I did not have test in in data_config.yaml).

I st ill have problem with metrics though, as you will see further down.

python val.py --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt --img 2560 --task test

val: data=C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml, weights=['C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt'], batch_size=32, imgsz=2560, conf_thres=0.001, iou_thres=0.6, task=test, device=, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=False, project=runs\val, name=exp, exist_ok=False, half=False
YOLOv5  v6.0-38-gc0c15d8 torch 1.8.2 CUDA:0 (NVIDIA GeForce RTX 3060, 12288.0MB)

Fusing layers...
Model Summary: 308 layers, 21049761 parameters, 0 gradients, 50.3 GFLOPs
Traceback (most recent call last):
  File "val.py", line 360, in <module>
    main(opt)
  File "val.py", line 334, in main
    run(**vars(opt))
  File "C:\Users\Fobbe\anaconda3\envs\yolo\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "val.py", line 152, in run
    dataloader = create_dataloader(data[task], imgsz, batch_size, gs, single_cls, pad=pad, rect=True,
KeyError: 'test'

Also. Even when I use "--task val" the precision from this result is far lower than the precision (and other metrics as well) that I saw in my results.csv during the training. How come?

With --task val for all classes row I get 0.35 P, 0.675 R, 0.411 MAP@.5

In results.csv for the last row (almost the best one, some previous row are slightly better) I have: 0.99754 P, 0.98379 R, 0.98431 mAP@.5.

With: --task val
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|█████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.13s/it]
                 all         14        330       0.35      0.675      0.411      0.354
                Class A         14         21      0.287      0.905      0.282      0.245
                Class B         14        262      0.799      0.996      0.995       0.86
                Class C         14         34      0.286      0.647      0.285      0.242
                Class D         14         13     0.0287      0.154     0.0807     0.0686

Results.csv header and last row
               epoch,      train/box_loss,      train/obj_loss,      train/cls_loss,   metrics/precision,      metrics/recall,     metrics/mAP_0.5,metrics/mAP_0.5:0.95,        val/box_loss,        val/obj_loss,        val/cls_loss,               x/lr0,               x/lr1,               x/lr2
                 499,            0.016152,            0.061862,           0.0018148,             0.99754,             0.98379,             0.98431,             0.88187,            0.014116,            0.049493,           0.0011332,           0.0010004,           0.0010004,           0.0010004

This seems like a huge difference? I assumed the metrics in results.csv was for the validation set. Is this not the case? Is it for the training set? Wouldn't it be overfitting to use the training set here?

Thanks in advance

glenn-jocher · 2021-12-04T14:50:38Z

@KristofferK --task argument allows you to specify any split, default val.

KristofferK · 2021-12-04T14:55:06Z

Thanks again

I solved --task by adding "test" to my data_config.yaml. For metrics it seems to be a matter of toggling --conf-thres and --iou-thres. Since when I run test.py I get very good results, but val.py does not reflect these good results. I will further experiment with it. Thank you for the quick responses.

KristofferK · 2021-12-04T14:55:52Z

I would however still like to know if the results.csv is for training set or for validation set? Would it not be overfitting to select a model based on the training set results?

It's just that the metrics in my results.csv vary a lot from those I get in val.py.

Should I rather create a new issue for this?

glenn-jocher · 2021-12-04T15:19:35Z

@KristofferK there is no test.py.

Training always runs on validation set. This is standard practice in any ML workflow. You can browse the code here:

yolov5/train.py

Lines 352 to 367 in 7bf04d9

    
           # mAP 
        
           callbacks.run('on_train_epoch_end', epoch=epoch) 
        
           ema.update_attr(model, include=['yaml', 'nc', 'hyp', 'names', 'stride', 'class_weights']) 
        
           final_epoch = (epoch + 1 == epochs) or stopper.possible_stop 
        
           if not noval or final_epoch:  # Calculate mAP 
        
               results, maps, _ = val.run(data_dict, 
        
                                          batch_size=batch_size // WORLD_SIZE * 2, 
        
                                          imgsz=imgsz, 
        
                                          model=ema.ema, 
        
                                          single_cls=single_cls, 
        
                                          dataloader=val_loader, 
        
                                          save_dir=save_dir, 
        
                                          plots=False, 
        
                                          callbacks=callbacks, 
        
                                          compute_loss=compute_loss)

KristofferK · 2021-12-04T15:32:36Z

@glenn-jocher That is my mistake. I meant that when I run train.py (results.csv) I have a very good metrics. When I use detect.py, I get good annotations. When I use val.py, I do not get good metrics - not even on --task val. I have included some results. Hopefully it makes sense.

From results.csv:
Precision: 0.99754
Recall: 0.98379
mAP@.5: 0.98431

               epoch,      train/box_loss,      train/obj_loss,      train/cls_loss,   metrics/precision,      metrics/recall,     metrics/mAP_0.5,metrics/mAP_0.5:0.95,        val/box_loss,        val/obj_loss,        val/cls_loss,               x/lr0,               x/lr1,               x/lr2
               
                 499,            0.016152,            0.061862,           0.0018148,             0.99754,             0.98379,             0.98431,             0.88187,            0.014116,            0.049493,           0.0011332,           0.0010004,           0.0010004,           0.0010004

But when I use val.py, I do not get the same results. Not at all.

This is with manually set confience:

python val.py --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt --img 2560 --task val --conf-thres 0.25 --iou-thres 0.3

               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|█| 1/1 [00:06<00:00,  6.81
                 all         14        330      0.249      0.242      0.246      0.216
             Class A         14         21          0          0          0          0
             Class B         14        262      0.996      0.969      0.986      0.862
             Class C         14         34          0          0          0          0
             Class D         14         13          0          0          0          0

This is with default confidence flags:

python val.py --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt --img 2560 --task val

               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|█| 1/1 [00:07<00:00,  7.19
                 all         14        330       0.35      0.675      0.411      0.354
             Class A         14         21      0.287      0.905      0.282      0.245
             Class B         14        262      0.799      0.996      0.995       0.86
             Class C        14         34      0.286      0.647      0.285      0.242
             Class D         14         13     0.0287      0.154     0.0807     0.0686

Am I missing a flag to make the val.py give me the same good results I see in the results.csv?

When I use detect.py it does give me good annotations, and these annotations have high confience (0.95+ in general), why I don't understand the low precision and recall when i use val.py.

In case it matters, this is how I trained the model

python train.py --img 2560 --batch 2 --epochs 500 --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights yolov5m.pt

glenn-jocher · 2021-12-04T15:36:11Z

@KristofferK I don't know what you're asking. Your metrics are your metrics, there's nothing for me to do here.

You obtain metrics on your dataset by running val.py with the same train.py settings, i.e.

python train.py --data path/to/data.yaml

python val.py --data path/to/data.yaml --weights path/to/best.pt

That's it.

KristofferK · 2021-12-04T15:41:47Z

But shouldn't the metrics from val.py match the metrics from train.py?

If train.py shows the metrics for the validation set and val.py shows the metrics for the validation set (--task val or no --task), then these should surely be the same, no?

nqthai309 · 2021-12-08T16:18:29Z

@KristofferK metrics obtain when training after epoch (P, R, mAP) that by last.pt model at the time. it will different when you run val.py with best.pt, because last.pt different best.pt.

KristofferK · 2021-12-08T16:25:50Z

@KristofferK metrics obtain when training after epoch (P, R, mAP) that by last.pt model at the time. it will different when you run val.py with best.pt, because last.pt different best.pt.

I believe that would make sense, if the metrics I got on val.py were better. But when I use val.py with best.pt, the metrics are worse than the last.pt during training.

nqthai309 · 2021-12-08T16:28:51Z

maybe conf-thres and iou-thres different.

glenn-jocher · 2021-12-08T16:38:06Z

@KristofferK 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible to produce the problem
✅ Complete – Provide all parts someone else needs to reproduce the problem
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

KristofferK · 2021-12-09T10:05:17Z

It seems to be because train.py uses --single-cls for metrics (despite single_cls being false in opt.yaml) while val.py does not use --single-cls by default.

As such, if I add --single-cls to my val.py argument, then it will produce the same results as I had in train.py. But I did not use --single-cls for train.py.

But will this not result in suboptimal trained models, if these models are trained as a single class classifier? Or is it only during printing to console and results.csv it uses single-cls?

glenn-jocher · 2021-12-09T12:35:06Z

@KristofferK --single-cls is an optional argument that allows you to train any dataset in single-class mode, i.e. all 80 classes in COCO will be treated as one class.

KristofferK · 2021-12-09T12:40:47Z

@glenn-jocher But even when not specifying --single-cls during training (train.py), then the printed metrics (in terminal or results.csv), seems to be the metrics for a single classifier, i.e the metrics it prints ignore if the class is correctly predicted.

This must be the case, since when I run val.py with --single-cls I get the same metrics as I get in train.py without --train-cls.
But if I run val.py without --single-cls, I do not get the same metrics as train.py.

So the metrics printed during training are metrics for a single classifier, when though we are training a multi class model?

glenn-jocher · 2021-12-09T12:43:08Z

@KristofferK I have no idea what you are asking. If you are not trying to force a dataset into single class mode there is no reason to use this argument.

Train and Val operate correctly. If you have a reproducible bug submit a bug report with code to reproduce.

KristofferK added the question Further information is requested label Dec 4, 2021

KristofferK mentioned this issue Dec 16, 2021

Wrong class id in --save-txt, but correct class name in image annotation #6009

Closed

2 tasks

KristofferK closed this as completed Dec 17, 2021

emailic mentioned this issue Jul 21, 2022

Obtain per-class Metrics relevant to training if the cell output is lost #8667

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance metrics per class for comparison #5880

Performance metrics per class for comparison #5880

KristofferK commented Dec 4, 2021

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021 •

edited

Loading

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021

KristofferK commented Dec 4, 2021 •

edited

Loading

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021 •

edited

Loading

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021

nqthai309 commented Dec 8, 2021

KristofferK commented Dec 8, 2021

nqthai309 commented Dec 8, 2021

glenn-jocher commented Dec 8, 2021 •

edited

Loading

KristofferK commented Dec 9, 2021

glenn-jocher commented Dec 9, 2021

KristofferK commented Dec 9, 2021

glenn-jocher commented Dec 9, 2021

Performance metrics per class for comparison #5880

Performance metrics per class for comparison #5880

Comments

KristofferK commented Dec 4, 2021

Search before asking

Question

Additional

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021 • edited Loading

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021

KristofferK commented Dec 4, 2021 • edited Loading

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021 • edited Loading

glenn-jocher commented Dec 4, 2021

KristofferK commented Dec 4, 2021

nqthai309 commented Dec 8, 2021

KristofferK commented Dec 8, 2021

nqthai309 commented Dec 8, 2021

glenn-jocher commented Dec 8, 2021 • edited Loading

How to create a Minimal, Reproducible Example

KristofferK commented Dec 9, 2021

glenn-jocher commented Dec 9, 2021

KristofferK commented Dec 9, 2021

glenn-jocher commented Dec 9, 2021

KristofferK commented Dec 4, 2021 •

edited

Loading

KristofferK commented Dec 4, 2021 •

edited

Loading

KristofferK commented Dec 4, 2021 •

edited

Loading

glenn-jocher commented Dec 8, 2021 •

edited

Loading