Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train YOLOv3-SPP from scratch to 62.6 mAP@0.5 #310

Closed
Aurora33 opened this issue May 31, 2019 · 151 comments
Closed

Train YOLOv3-SPP from scratch to 62.6 mAP@0.5 #310

Aurora33 opened this issue May 31, 2019 · 151 comments
Labels

Comments

@Aurora33
Copy link

Hi,
Thanks for sharing your work !
I would like what is your configuration for the training of yolov3.cfg to get 55% MAP ?
We tried 100 epochs but we got a MAP (35%) who don't really change much more. And the test loss start diverge a little.
Why you give a very high loss gain for the confidence loss ?
Thanks in advance for your reply.
results

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 1, 2019

@Aurora33 the mAPs reported in https://github.com/ultralytics/yolov3#map are using the original darknet weights files. We are still trying to determine the correct loss function and optimal hyperparameters for training in pytorch. There are a few issues open on this, such as #205 and #12. A couple things of note:

  • The plotted mAPs are at 0.1 conf_thres (for speed during training). If you run test.py directly it will run mAP at 0.001 conf_thres, which will produce a higher mAP.
  • Your LR scheduler may or may not have applied here, depending on how you set your number of epochs argument in the argparser --epochs.
  • Darknet training uses multi_scale by default, with scaling from 50% to 150% of your default size.
  • Darknet training also involves several steps I believe, including training on other datasets and altering layers. You can read about this more in the YOLOv2 and YOLOv3 papers: https://pjreddie.com/publications/
  • This implementation lacks the 0.7 ignore theshold in the original darknet, which is on our TODO list but not yet implemented.

@majuncai
Copy link

I also get 38% mAP until 170 epoches on COCO dataset, and the mAP don't really change much more

@glenn-jocher
Copy link
Member

@majuncai can you post your results? Did you use --multi-scale? We have made quite a few updates recently, in particular to multi-scale, which is required to achieved the best results, as well as training to the last epoch specified in order for the LR scheduler to take effect.

@majuncai
Copy link

@glenn-jocher
QQ图片20190613142623
I didn't change any parameters, I didn't use multi-scale.

@majuncai
Copy link

@glenn-jocher
微信图片_20190613153553

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 13, 2019

@majuncai I see. The main things I noticed is that your LR scheduler has not taken effect, since it only kicks in at 80% and 90% of the epoch count. The mAP typically increases significantly after this. Also multi-scale training has a large effect. Lastly, we reinterpreted the darknet training settings so that we believe you only need 68 epochs for full training.

All of these changes have been applied in the last few days. I recommend you git pull and train from scratch to 68 epochs. Then you can plot your results and upload them again here using the following command:
from utils.utils import *; plot_results()

@XinjieInformatik
Copy link

Hi Glenn, thanks for your great work. For classification loss, YOLO V3 paper said they don't use softmax, but nn.CrossEntropyLoss() actually contain softmax. And the paper contains ignore_threshold. Could these affect the overall mAP?

@glenn-jocher
Copy link
Member

@XinjieInformatik yes these could affect the mAP greatly, in particular the ignore threshold. For some reason we've gotten reduced mAP with BCELoss for the classification than with CELoss. I don't know why. This may be a PyTorch phenomenon, as simpler tasks like MNIST also train better with CELoss than BCELoss in PyTorch.

@fereenwong
Copy link

@glenn-jocher Hi, I am curious about how the provided yolov3.pt is obtained. Is it transformed from yolov3.weight or trained with the code you shared?

@glenn-jocher
Copy link
Member

@fereenwong yolov3.pt is exported from yolov3.weights.

@ktian08
Copy link

ktian08 commented Jul 24, 2019

@glenn-jocher Just finished an experiment training on the full COCO dataset from scratch, using the default hyperparameter values. My model was YOLOv3-320, and I trained to 200 epochs with multi-scale training on and rectangular training off. After running test.py, I managed to get 47.4 mAP, which unfortunately is not the 51.5 corresponding to pjreddie's experiment.

I can try training again but this time to 273 epochs, although it seems that each stage before the learning rate decreased per the scheduler had already plateaued, so I don't think it would benefit much. Is there a comprehensive TODO list that you think will improve mAP? I notice you mentioned the 0.7 ignore threshold. What do you mean by this? I searched through the darknet repo and didn't get any relevant 0.7 hits.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 24, 2019

@ktian08 ah, thanks for the update! This is actually a bit better than I was expecting at this point, since we are still tuning hyperparameters. The last time we trained fully we trained 320 also to 100 epochs and ended up at 0.46 mAP below (no multiscale, no rect training), using older hyps from about a month aqo. Remember the plots are at conf_thres 0.1, test.py runs natively at conf_thres 0.001 which adds a few percent mAP compared to the training plots. Can you post a plot of your training results using from utils import utils; utils.plot_results() and a copy of your hyp dictionary from train.py?

0.70 is a threshold darknet uses to not punish anchors which aren't the best but still have an iou > 0.7. We use a slightly different method in this repo, which is the hyp['iou_t'] parameter.

Yes I also agree that training seems to be plateauing too quickly. This could be because our hyp tuning is based on epoch 0 results only, so it may be favoring aspects that aggressively increase mAP early on, which may not be best for training later epochs. Our hyp evolution code is:

python3 train.py --data data/coco.data --img-size 320 --epochs 1 --batch-size 64 --accumulate 1 --evolve

results_320

@ktian08
Copy link

ktian08 commented Jul 24, 2019

hey
Screen Shot 2019-07-24 at 4 42 14 PM

Ah, I see. My repo currently does have the reject boolean set to True, so it is thresholding by iou_t, just by a different value. Are you saying darknet uses 0.7 for this value?

I have not begun evolving hyperparameters yet, as the ones I've used were the default ones for yolov3-spp I believe. However, I've modified my train script to evolve every opt.epochs because that's how I interpreted the script rather than evolving based on the first epoch. To accomplish this, I've also changed train to output the best_result (based on 0.5 * mAP + 0.5 * f1) rather than the result from the last epoch so print_mutations has the correct value. I'll try evolving the hyperparameters based on a smaller number of epochs > 1 and let you know if I get better results.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 25, 2019

@ktian08 ah excellent. Hmm, your results are very different than the ones I posted. The more recent results should see almost 0.15 mAP starting at epoch 0, whereas yours start around 0.01 at epoch 0 and increase slowly from there.

Clearly my plots show faster short term results, but I don't know if they are plateauing lower or higher than yours, its hard to tell.

No, the 0.7 value corresponds to a different type of iou thresholding in darknet. In this repo if iou < hyp['iou_t'] then no match is made. This prevents large anchors from attempting to match with small targets and vice versa. This parameter seems to evolve to 0.20-0.35 typically. In your version its at 0.3689, whereas now we have 0.194, though the latest unpublished hyperparameters show a best value of 0.292.

Unfortunately we are resource constrained so we can't evolve as much as we'd like. Ideally you'd probably want to run the evolution off of the result say the first 10 or 20 epochs, but we are running it off of epoch 0 results, which allows us to evolve many more generations, even as its unclear if epoch 0 success correlates 100% with epoch 273 success.

Also beware that we have added the augmentation parameters to the hyp dictionary, so you may want to git pull to get the latest. You can also evolve your own hyperparameters using the same code I posted before, or if you want you could contribute to our hyp study as well by evolving to a cloud bucket we have.

@ktian08
Copy link

ktian08 commented Jul 25, 2019

Right, I think you start at 0.15 mAP because you load in the darknet weights as your default setting. I modified the code so I'm truly training COCO from scratch.

I'll pull the new hyperparameters and try evolving within 10-20 epochs then. Thanks!

@glenn-jocher
Copy link
Member

@ktian08 ah yes this makes sense then, we are looking at apples to oranges.

Regarding fitness, I set it as the average of mAP and F1, because I saw that when I set it as only mAP, the evolution would favor high R and low P to reach the highest mAP, so I added the F1 in attempt to balance it.

yolov3/train.py

Lines 342 to 343 in df4f25e

def fitness(x): # returns fitness of hyp evolution vectors
return 0.5 * x[:, 2] + 0.5 * x[:, 3] # fitness = 0.5 * mAP + 0.5 * F1

If you are doing lots of training BTW, you should install Nvidia Apex for mixed precision if you haven't already. This repo will automatically use it if it detects it.

@glenn-jocher
Copy link
Member

@ktian08 I've added a new issue #392 which illustrates our hyperparameter evolution efforts in greater detail.

As I mentioned, with unlimited resources you would ideally evolve the full training results:

python3 train.py --data data/coco.data --img-size 320 --epochs 273 --batch-size 64 --accumulate 1 --evolve

But since we are resource constrained we evolve epoch 0 results instead, under the assumption that what's good for epoch 0 is good for full training. This may or may not be true, we simply are not sure at this point.

python3 train.py --data data/coco.data --img-size 320 --epochs 1 --batch-size 64 --accumulate 1 --evolve

@ktian08
Copy link

ktian08 commented Jul 25, 2019

OK, I'll try installing Apex and evolving to as many epochs as I can. Earlier, I made a mistake calculating the mAP for my experiment, as I didn't pass in the --img-size parameter to test.py and thus my model tested on size 416 images. My newly calculated mAP is 47.4.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 26, 2019

@ktian08 ah I see. I forgot to mention that you should use the --save-json flag with test.py, as the official COCO mAP is usually about 1% higher than what the repo mAP code reports. You could try best.pt also instead of last.pt:

python3 test.py --weights weights/best.pt --img-size 320 --save-json

@ktian08
Copy link

ktian08 commented Jul 26, 2019

Yep, already using --save-json and best.pt!

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 30, 2019

@ktian08 I updated the code a bit to add a --img-weights option to train.py. When this is set the dataloader selects images randomly weighted by their value, which is defined as the type of objects they have and how well the mAP is evolving on those exact objects. If mAP is low on hair dryers for example, and there are few hair dryers in the dataset, then many more images of hairdryers will be selected than say images of people.

This seems to show better mAP, at least during the first few epochs, both when training from darknet53 as well as when training with no backbone (0.020 to 0.025 mAP first epoch at 416 without backbone). I don't know what effect it will have long term however. I am currently training a 416 model to 273 epochs using all the default settings with the --img-weights flag. I just started this, so I should have results out in about a week, and then I'll share here.

@ktian08
Copy link

ktian08 commented Aug 2, 2019

@glenn-jocher Does training seem to improve using --img-weights based on your experiments so far? I am currently retraining on new hyperparameters I got from evolving, but despite the promising mAPs gotten during evolution, I see that the mAP for my new experiment is pretty much the same as my control experiment ~60 epochs in.

@glenn-jocher
Copy link
Member

@ktian08 I might be seeing a similar effect. It's possible that the first few epochs are much more sensitive to the hyperparameters, and small changes in them eventually converge to the same result after 50-100 epochs.

I'm not sure the conclusion to draw from this, other than hyperparameter searches based on quick results (epoch 0, epoch 1 results etc.) may not be as useful as they appear. Oddly enough I also saw about no change in mAP at the baseline img-size when using --multi-scale, even after 30-40 epochs.

@ktian08
Copy link

ktian08 commented Aug 2, 2019

@glenn-jocher Hmm... I trained to 20 epochs while tuning hyperparameters but when I compare the 20th epoch even I don't see the 7 mAP increase that I should've seen.

Maybe --img-size reduces AP for the classes doing well during training, so that the mAP in the end is the same regardless.

I noticed that pjreddie describes multi-scale training much differently in the YOLO9000 paper than what is being implemented here (scaling from /1.5 to * 1.5). He says every 10 batches he chose a new dimension from 320 to 608 as long as it was divisible by 32, allowing for a much larger range for YOLOv3-320, which might help. Does his YOLOv3 repo also implement it like this, or is it your way?

@glenn-jocher
Copy link
Member

@ktian08 here is the current comparison using all default settings (416, no multi-scale, etc). I'll update daily (about 40 epochs/day). Training both the full 273.

  • Orange is baseline python3 train.py
  • Blue is experiment python3 train.py --img-weights

results

@glenn-jocher
Copy link
Member

@ktian08 this should implement it as closely as possible to darknet. Every 10 batches (i.e. every 640 images) the img-size is randomly rescaled from 1/1.5 to 1*1.5, rounded to the nearest 32-multiple, so from 288-640 for img-size 416 or from 224-480 for img-size 320.

I've run AlexyAB/darknet to verify and it does the same.

@HwangQuincy
Copy link

@glenn-jocher Yes, we use apex for mixed precision training, but I never compare with single-gpu training for mAP result due to long training time for comparison, so I doubt whether DistributedSampler will influence mAP results

@HwangQuincy
Copy link

@glenn-jocher Meanwhile, I don't like using kmeans for searching optimized anchors, as mentioned in Yolov4 paper, optimized anchors only improve 0.9%mAP, but highly depend on dataset

@glenn-jocher
Copy link
Member

glenn-jocher commented May 23, 2020

@HwangQuincy our function uses kmeans for an initial guess (but you can also supply your own initial guess), then the anchors are optimized with a genetic algorithm for 1000 generations. But if you don't want to use it, then just supply whatever you prefer in the cfg.

There's typically no such thing as 'anchor free', I'm pretty sure that's just marketing buzz. When you see that it typically means the anchors are correlated to the output layer stride rather than the training labels.

Ok got it on the training. Can you submit a PR for mp spawn then and we will review?

@HamsterHuey
Copy link

There's typically no such thing as 'anchor free', I'm pretty sure that's just marketing buzz. When you see that it typically means the anchors are correlated to the output layer stride rather than the training labels.

I'm pretty sure that CenterNet (Objects as points) is an anchor-free approach. Most any keypoint-style detector would work similarly and they do not rely on anchor priors.

@glenn-jocher
Copy link
Member

@HamsterHuey ah that's a good point. I haven't played around with centernet so I'm not too familiar with their method. The main thing which 'anchors' is that they are simply a normalization of the outputs. All neural networks perform best when the inputs and outputs are normalized (i.e. mean of 0 and standard deviation of 1), not only for CNNs but even for the simplest regression networks, so this practice is really widespread across all ML.

@HamsterHuey
Copy link

@glenn-jocher - While it is true that anchors help regress out bboxes, they are also a bit of a hack in a way (imo) to overcome the other aspect of Yolo and similar anchor-based approaches due to their output featuremap being substantially lower dimensional (spatially) compared to the input image. So now you have a "grid-cell" which in YoloV2 was 32x smaller than the input image dimensions so a 416 x 416 image resulted in a 13 x 13 output set of "grid-cells". The problem with grid-cells is that now each grid-cell needs to be able to make predictions for all objects whose center lies within its cell, and often-times there are several objects within a grid-cell due to the large reduction-factor / network-stride associated with each cell. So you have anchor boxes to help regress out box dimensions, and a pretty complicated set of logic in the loss function that determines which anchor gets assigned a ground-truth annotation for a given grid-cell, how to handle loss calcs for the other non-assigned anchors, etc.

Keypoint-styled networks like centernet discard a lot of this because their output featuremap size is much larger (only 4X smaller than input image), so they can directly have each "grid-cell" only be responsible for a single class detection. Since each cell is responsible for a single detection, there is no need to have anchor boxes or any complex assignment of ground-truth annotations to the correct anchor for loss calculations. It simplifies the loss function drastically, and most surprisingly, the bbox width and height dimensions can directly be regressed out via an L1 loss.

Like @HwangQuincy , I've not been a big fan of anchor boxes as they have always seemed more of a band-aid to fix a limitation of the underlying approach to detection as you now have a coupling between your trained network weights and the dataset you trained on. The implications on transfer-learning are also not often noted, but if you need to tweak the number of anchors for your final dataset of interest, you are still stuck with the COCO anchors if you wish to rely on the pretrained COCO weights to start from. All that being said, I don't think customized anchors make a huge difference one way or the other. It's always struck me as a bit of a hacky way to overfit to a specific dataset to eke out max performance, but that's just my opinion 😃

Btw, nice job keeping up with maintaining this repo and bringing some sanity to the world of Pytorch forks of Yolo 😄

@glenn-jocher
Copy link
Member

@HamsterHuey yes, this is a pretty good summary. Actually the original yolo was anchor free, regressing the width and height directly as you say the keypoint detectors do. Do the keypoint detectors do the regression in a normalized space, or is it actual pixel coordinates?

The strides in yolov3 are 32, 16, and 8, though you could add additional layers for say stride 64 or stride 4. It would be nice to unify the outputs somehow into a more uniform strategy as you say, i.e. simply use the stride 8 output and a single anchor per grid cell for example, but there are tradeoffs involved in these other approaches, which is probably why we are where are today with the anchors.

I looked at the centernet repo BTW, and it seems to produce lower mAP than this repo, at a much slower inference time, so its possible you are seeing some of the tradeoffs in practice there. I'm open to new ideas of course though. Our mission is to produce the most robust and usable training and detection repos across all possible custom datasets, so anchor robustness is part of that.

BTW, we have been developing a new repo these last couple months that should be released publicly in the next few weeks. It aims to improve training robustness and user friendliness across the training process, and of course improve training results as well. This new repo includes an anchor analysis before training to inform of best possible recall (BPR) etc given the supplied settings, a step which is missing now.

@HamsterHuey
Copy link

Nice to hear about your new repo. Out of curiousity, do you have a documentation somewhere of the differences between the regular Yolov3 SPP vs the ultralytics approach that helped achieve the higher results as listed in the main Readme? I'd love to get a sense of the main learnings you've had on that front.

I think comparing detectors is always hard as the Yolov4 paper also shows, because there are so many pieces to maximizing performance on a given dataset. As the jump from YoloV3 -> YoloV3-SPP -> YoloV3-SPP-Ultralytics shows, there are many tweaks one can make to eke out more performance. I don't necessarily think CenterNet is better, though it also has not been optimized nearly as much as some of the work that has gone on both here with your tireless efforts, and @AlexeyAB 's efforts on his darknet fork.

To your question regarding the loss function in centernet, the width and height are directly regressed out in grid-cell space (so 4X reduced coordinate space compared to the input image coordinate space). Having spent a lot of time with a custom YoloV2 implementation and now with CenterNet, I will say that some of the things that I like about the latter are that the loss function is astonishingly simple and easy to understand and compute which results in less bottlenecks when training, and also it does pretty well at inference without the need for NMS which can be nice in compute limited situations. But I think there are always tradeoffs in any approach, so I'm always curious to see how different approaches do and also to learn of new findings that improve performance.

@glenn-jocher
Copy link
Member

@HamsterHuey ah that's very interesting, I didn't realize you could skip the NMS there.

Yes this is true that the loss function is very complicated here, in particular with regard to constructing the targets appropriately and with the matching anchors to labels as you mentioned.

And yes you are also correct that the implementation matters a lot, as even identical architectures can return greatly different results.

@glenn-jocher
Copy link
Member

@HamsterHuey and to answer your question, we do not have a consolidated changelist unfortunately. With limited resources the documentation unfortunately takes a backseat to the development work. Hopefully at some point later this year we can write a paper to document the new repo.

@HwangQuincy
Copy link

@glenn-jocher I have found the reason that cause the mAP difference, It's due to I set accumulate=num_gpus * original accumulate, when setting batch_size = original batch_size / num_gpus (it is correct when we use DistributedSampler as given in pytorch examples)

@HwangQuincy
Copy link

HwangQuincy commented May 24, 2020

@HwangQuincy our function uses kmeans for an initial guess (but you can also supply your own initial guess), then the anchors are optimized with a genetic algorithm for 1000 generations. But if you don't want to use it, then just supply whatever you prefer in the cfg.

There's typically no such thing as 'anchor free', I'm pretty sure that's just marketing buzz. When you see that it typically means the anchors are correlated to the output layer stride rather than the training labels.

Ok got it on the training. Can you submit a PR for mp spawn then and we will review?

Sorry, I have modified your original code, and the modification includes the loss computation, target(positive samples) generation, and network model definition (without .cfg) for faster new model introduction. Thus my code with mp.spawn can't immediately submitted. When I have a idle time, I will submit a PR with mp.spawn and DistributedSampler

@HwangQuincy
Copy link

@HamsterHuey The anchor-free includes key-points based and center-based. As well known, Yolov1 is the first of center-based anchor-free method. I prefer to the center-based anchor-free method, and guess there is no performance gap between key-points based and center-based anchor-free method provided the annotation is in form of bounding box. As stated in ATSS (https://arxiv.org/pdf/1912.02424.pdf), there is no significant performance gap between center-based anchor-free and anchor based methods

@selous123
Copy link

then the anchors are optimized with a genetic algorithm for 1000 generations.

Hi, any suggestions to find the code about the genetic algorithm for optimized anchors?

@glenn-jocher
Copy link
Member

@selous123

yolov3/utils/utils.py

Lines 655 to 660 in d6d6fb5

def kmean_anchors(path='./data/coco64.txt', n=9, img_size=(640, 640), thr=0.20, gen=1000):
# Creates kmeans anchors for use in *.cfg files: from utils.utils import *; _ = kmean_anchors()
# n: number of anchors
# img_size: (min, max) image size used for multi-scale training (can be same values)
# thr: IoU threshold hyperparameter used for training (0.0 - 1.0)
# gen: generations to evolve anchors using genetic algorithm

@HwangQuincy maybe I'll try to implement mp spawn on my own, as the improvement does seem worth the effort. Can you point me to the best example you know to get started?

@HwangQuincy
Copy link

@glenn-jocher Sorry to reply so late, the best example as I known is https://github.com/pytorch/examples/blob/master/imagenet/main.py, where the mp.spawn is used for distributed training.
BTW, when using mp.spwan with accumulate, accumulate should be kept unchanged.

@glenn-jocher
Copy link
Member

Ok got it, thanks! I’ll try it out this week.

@feixiangdekaka
Copy link

Ok got it, thanks! I’ll try it out this week.
'degrees': 1.98 * 0, # image rotation (+/- deg)
'translate': 0.05 * 0, # image translation (+/- fraction)
'scale': 0.05 * 0, # image scale (+/- gain)
'shear': 0.641 * 0} # image shear (+/- deg)
the parameters take no effect when training? When train my own dateset ,should I use --evolve or how to change the parameter in hyp?

@glenn-jocher
Copy link
Member

@feixiangdekaka yes, your observation is correct, a number multiplied by zero will be zero.

@TAOSHss
Copy link

TAOSHss commented Jun 1, 2020

@glenn-jocher I found that you set the parameter world_size=1 during the initialization of distributed training. What is the special significance of this
if device.type != 'cpu' and torch.cuda.device_count() > 1 and torch.distributed.is_available(): dist.init_process_group(backend='nccl', # 'distributed backend' init_method='tcp://127.0.0.1:9999', # distributed training init method world_size=1, # number of nodes for distributed training rank=0) # distributed training node rank

Why not use default parameters so that all available processes can be used automatically

@glenn-jocher
Copy link
Member

@TAOSHss I believe these are the pytorch defaults from a tutorial code that was used for multi gpu training. We've received comments recently that mp spawns performs better in multi gpu contexts btw.

@github-actions
Copy link

github-actions bot commented Jul 2, 2020

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 2, 2020
@github-actions github-actions bot closed this as completed Jul 7, 2020
@glenn-jocher
Copy link
Member

@glenn-jocher Speed is about 12-13 min/epoch using mp.spwan and 17-18 min/epoch without mp.spwan when training with 2 RTX2080Ti GPU without multi-scales and img size = 384. This improvement should be from the avoidance of well-known PIL problem as mentioned in the Pytorch doc.

@HwangQuincy is there a chance you might be able to help us with our multi-GPU strategy for https://github.com/ultralytics/yolov5? A couple users are working on a DDP PR (ultralytics/yolov5#401) with mixed results. Since you got an mp.spawn implementation working here so well I thought you might be the best expert to consult on this. Our current multi-gpu strategy with YOLOv5 is little changed from what we have here, it is DDP with world size 1. It's producing acceptable results, but I don't want to leave any improvements on the table if we can incorporate them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests