Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance metrics per class for comparison #5880

Closed
1 task done
KristofferK opened this issue Dec 4, 2021 · 17 comments
Closed
1 task done

Performance metrics per class for comparison #5880

KristofferK opened this issue Dec 4, 2021 · 17 comments
Labels
question Further information is requested

Comments

@KristofferK
Copy link

Search before asking

Question

Hello,

Is it possible to easily compare the performance on different classes in a multiclass model? I've trained it on a custom dataset with 4 classes. Let's say Class A, B, C, and D.

How would I see e.g. the precision, recall and mAP per class? This is to see if my model is better at a certain class or if it is equally good amongst them all.

I suspect that my model is very good on class A, good on B and C, whole it is mediocre at class D. This is what it looks like when I manually inspect the annotations on the test results. But I'd like the actual performance metrics to support my hypothesis.

Thanks in advance.

Additional

No response

@KristofferK KristofferK added the question Further information is requested label Dec 4, 2021
@glenn-jocher
Copy link
Member

@KristofferK metrics are automatically displayed per class. No action is required.

Screenshot 2021-12-04 at 13 08 47

@KristofferK
Copy link
Author

KristofferK commented Dec 4, 2021

Thank you for the very quick response, Glenn!

~~ When I use val.py, I do not seem to be able to use "--task test". Do you have any idea how to fix this? It works with "--task val", "--task study", etc. I am on the latest commit.~~ (Fixed: I did not have test in in data_config.yaml).

I st ill have problem with metrics though, as you will see further down.

python val.py --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt --img 2560 --task test

val: data=C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml, weights=['C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt'], batch_size=32, imgsz=2560, conf_thres=0.001, iou_thres=0.6, task=test, device=, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=False, project=runs\val, name=exp, exist_ok=False, half=False
YOLOv5  v6.0-38-gc0c15d8 torch 1.8.2 CUDA:0 (NVIDIA GeForce RTX 3060, 12288.0MB)

Fusing layers...
Model Summary: 308 layers, 21049761 parameters, 0 gradients, 50.3 GFLOPs
Traceback (most recent call last):
  File "val.py", line 360, in <module>
    main(opt)
  File "val.py", line 334, in main
    run(**vars(opt))
  File "C:\Users\Fobbe\anaconda3\envs\yolo\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "val.py", line 152, in run
    dataloader = create_dataloader(data[task], imgsz, batch_size, gs, single_cls, pad=pad, rect=True,
KeyError: 'test'

Also. Even when I use "--task val" the precision from this result is far lower than the precision (and other metrics as well) that I saw in my results.csv during the training. How come?

With --task val for all classes row I get 0.35 P, 0.675 R, 0.411 MAP@.5

In results.csv for the last row (almost the best one, some previous row are slightly better) I have: 0.99754 P, 0.98379 R, 0.98431 mAP@.5.

With: --task val
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|█████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.13s/it]
                 all         14        330       0.35      0.675      0.411      0.354
                Class A         14         21      0.287      0.905      0.282      0.245
                Class B         14        262      0.799      0.996      0.995       0.86
                Class C         14         34      0.286      0.647      0.285      0.242
                Class D         14         13     0.0287      0.154     0.0807     0.0686
Results.csv header and last row
               epoch,      train/box_loss,      train/obj_loss,      train/cls_loss,   metrics/precision,      metrics/recall,     metrics/mAP_0.5,metrics/mAP_0.5:0.95,        val/box_loss,        val/obj_loss,        val/cls_loss,               x/lr0,               x/lr1,               x/lr2
                 499,            0.016152,            0.061862,           0.0018148,             0.99754,             0.98379,             0.98431,             0.88187,            0.014116,            0.049493,           0.0011332,           0.0010004,           0.0010004,           0.0010004

This seems like a huge difference? I assumed the metrics in results.csv was for the validation set. Is this not the case? Is it for the training set? Wouldn't it be overfitting to use the training set here?

Thanks in advance

@glenn-jocher
Copy link
Member

@KristofferK --task argument allows you to specify any split, default val.

@KristofferK
Copy link
Author

Thanks again

I solved --task by adding "test" to my data_config.yaml. For metrics it seems to be a matter of toggling --conf-thres and --iou-thres. Since when I run test.py I get very good results, but val.py does not reflect these good results. I will further experiment with it. Thank you for the quick responses.

@KristofferK
Copy link
Author

KristofferK commented Dec 4, 2021

I would however still like to know if the results.csv is for training set or for validation set? Would it not be overfitting to select a model based on the training set results?

It's just that the metrics in my results.csv vary a lot from those I get in val.py.

Should I rather create a new issue for this?

@glenn-jocher
Copy link
Member

@KristofferK there is no test.py.

Training always runs on validation set. This is standard practice in any ML workflow. You can browse the code here:

yolov5/train.py

Lines 352 to 367 in 7bf04d9

# mAP
callbacks.run('on_train_epoch_end', epoch=epoch)
ema.update_attr(model, include=['yaml', 'nc', 'hyp', 'names', 'stride', 'class_weights'])
final_epoch = (epoch + 1 == epochs) or stopper.possible_stop
if not noval or final_epoch: # Calculate mAP
results, maps, _ = val.run(data_dict,
batch_size=batch_size // WORLD_SIZE * 2,
imgsz=imgsz,
model=ema.ema,
single_cls=single_cls,
dataloader=val_loader,
save_dir=save_dir,
plots=False,
callbacks=callbacks,
compute_loss=compute_loss)

@KristofferK
Copy link
Author

KristofferK commented Dec 4, 2021

@glenn-jocher That is my mistake. I meant that when I run train.py (results.csv) I have a very good metrics. When I use detect.py, I get good annotations. When I use val.py, I do not get good metrics - not even on --task val. I have included some results. Hopefully it makes sense.

From results.csv:
Precision: 0.99754
Recall: 0.98379
mAP@.5: 0.98431

               epoch,      train/box_loss,      train/obj_loss,      train/cls_loss,   metrics/precision,      metrics/recall,     metrics/mAP_0.5,metrics/mAP_0.5:0.95,        val/box_loss,        val/obj_loss,        val/cls_loss,               x/lr0,               x/lr1,               x/lr2
               
                 499,            0.016152,            0.061862,           0.0018148,             0.99754,             0.98379,             0.98431,             0.88187,            0.014116,            0.049493,           0.0011332,           0.0010004,           0.0010004,           0.0010004

But when I use val.py, I do not get the same results. Not at all.

This is with manually set confience:

python val.py --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt --img 2560 --task val --conf-thres 0.25 --iou-thres 0.3

               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|█| 1/1 [00:06<00:00,  6.81
                 all         14        330      0.249      0.242      0.246      0.216
             Class A         14         21          0          0          0          0
             Class B         14        262      0.996      0.969      0.986      0.862
             Class C         14         34          0          0          0          0
             Class D         14         13          0          0          0          0

This is with default confidence flags:

python val.py --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights C:/Users/Fobbe/Documents/GitHub/yolov5/runs/train/exp52/weights/best.pt --img 2560 --task val

               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|█| 1/1 [00:07<00:00,  7.19
                 all         14        330       0.35      0.675      0.411      0.354
             Class A         14         21      0.287      0.905      0.282      0.245
             Class B         14        262      0.799      0.996      0.995       0.86
             Class C        14         34      0.286      0.647      0.285      0.242
             Class D         14         13     0.0287      0.154     0.0807     0.0686

Am I missing a flag to make the val.py give me the same good results I see in the results.csv?

When I use detect.py it does give me good annotations, and these annotations have high confience (0.95+ in general), why I don't understand the low precision and recall when i use val.py.

In case it matters, this is how I trained the model

python train.py --img 2560 --batch 2 --epochs 500 --data C:/Users/Fobbe/Desktop/drosophila-split-yolov5/data_config.yaml --weights yolov5m.pt

@glenn-jocher
Copy link
Member

@KristofferK I don't know what you're asking. Your metrics are your metrics, there's nothing for me to do here.

You obtain metrics on your dataset by running val.py with the same train.py settings, i.e.

python train.py --data path/to/data.yaml

python val.py --data path/to/data.yaml --weights path/to/best.pt

That's it.

@KristofferK
Copy link
Author

But shouldn't the metrics from val.py match the metrics from train.py?

If train.py shows the metrics for the validation set and val.py shows the metrics for the validation set (--task val or no --task), then these should surely be the same, no?

@nqthai309
Copy link

@KristofferK metrics obtain when training after epoch (P, R, mAP) that by last.pt model at the time. it will different when you run val.py with best.pt, because last.pt different best.pt.

@KristofferK
Copy link
Author

@KristofferK metrics obtain when training after epoch (P, R, mAP) that by last.pt model at the time. it will different when you run val.py with best.pt, because last.pt different best.pt.

I believe that would make sense, if the metrics I got on val.py were better. But when I use val.py with best.pt, the metrics are worse than the last.pt during training.

@nqthai309
Copy link

maybe conf-thres and iou-thres different.

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 8, 2021

@KristofferK 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible to produce the problem
  • Complete – Provide all parts someone else needs to reproduce the problem
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

  • Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
  • Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@KristofferK
Copy link
Author

It seems to be because train.py uses --single-cls for metrics (despite single_cls being false in opt.yaml) while val.py does not use --single-cls by default.

As such, if I add --single-cls to my val.py argument, then it will produce the same results as I had in train.py. But I did not use --single-cls for train.py.

But will this not result in suboptimal trained models, if these models are trained as a single class classifier? Or is it only during printing to console and results.csv it uses single-cls?

@glenn-jocher
Copy link
Member

@KristofferK --single-cls is an optional argument that allows you to train any dataset in single-class mode, i.e. all 80 classes in COCO will be treated as one class.

@KristofferK
Copy link
Author

@glenn-jocher But even when not specifying --single-cls during training (train.py), then the printed metrics (in terminal or results.csv), seems to be the metrics for a single classifier, i.e the metrics it prints ignore if the class is correctly predicted.

This must be the case, since when I run val.py with --single-cls I get the same metrics as I get in train.py without --train-cls.
But if I run val.py without --single-cls, I do not get the same metrics as train.py.

So the metrics printed during training are metrics for a single classifier, when though we are training a multi class model?

@glenn-jocher
Copy link
Member

@KristofferK I have no idea what you are asking. If you are not trying to force a dataset into single class mode there is no reason to use this argument.

Train and Val operate correctly. If you have a reproducible bug submit a bug report with code to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants