Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

width-height: 'yolo' method vs. 'power' method #168

Closed
100330706 opened this issue Mar 26, 2019 · 20 comments
Closed

width-height: 'yolo' method vs. 'power' method #168

100330706 opened this issue Mar 26, 2019 · 20 comments
Labels
documentation Improvements or additions to documentation question Further information is requested Stale Stale and schedule for closing soon

Comments

@100330706
Copy link

100330706 commented Mar 26, 2019

Hi! We are running your YOLO implementation into a 5 class detection task. However, it seems that at some iteration of some epoch (it is not always the same), the loss suddenly starts quickly going to infinite, giving nan values. The term that it seems that is increasing exponentially is the wh loss (wh tensor sometimes has negative values I don't know if this is normal). By applying your power method wh = torch.sigmoid(p[..., 2:4]) # wh (power method) instead of wh = p[..., 2:4] # wh (yolo method) it seems that this problem stops and the algorithm does not diverge. However, the wh loss flattens out at a higher value (around 1.07, 1.08 instead of going down to 0) than the other losses as shown below:

Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
38/59    425/1242   0.00705      1.07   0.00528  0.000134      1.09         4     0.254
38/59    426/1242    0.0071      1.07   0.00527  0.000133      1.09         4     0.258
38/59    427/1242   0.00711      1.08   0.00527  0.000133      1.09         4     0.254
38/59    428/1242   0.00712      1.08   0.00526  0.000133      1.09         3     0.251

Do you know any clue about why this could be happening? What are the supposed advantages of using the power method?

Kind regards.

@100330706 100330706 added the bug Something isn't working label Mar 26, 2019
@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 26, 2019

The wh losses are the most unstable in the darknet implementation due to their unbounded nature, as you noticed. We created the 'power' method to stabilize the training in such situations. Typically if the wh losses diverge they do so in the first few epochs.

However, the switch from darknet to power method for the wh paremter needs to be done in multiple places in the code, not just in the loss function. This can make switching back and forth confusing, so we should probably parameterize a switch for these two methods. I'll add this to the TODO list for the next release.

IMPORTANT: If you use the default yolov3.weights you should use the default darknet wh implementation. The power method should only be used when training a new model, and then all inference done on that model must also use the power method.

This issue has a comparison plot of the two methods:
#12 (comment)

@glenn-jocher glenn-jocher changed the title nan values issue solved by the power method but still not optimally converging width-height loss divergence: 'darknet' method vs. 'power' method Mar 26, 2019
@100330706
Copy link
Author

@glenn-jocher Appreciate your response. So, occasionally not converging is supposed to be normal when using the darknet method, isn't it? Where else should be applied the power method? So far I've seen it commented in build_targets function and in the yolo layer. Would it work by uncommenting just these and commenting the darknet ones?

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 28, 2019

@100330706 yes, these are the two places you need to make the switch, and in ONNX export also if you plan to use that feature. Note that the power method currently trends from 0 to 4 as the input trends from negative infinity to positive infinity. The yolo method has unbounded outputs, causing it diverge on occasion as you've noticed.

If you feel you need higher wh anchor multiples than 4 in your scenario you can use a different exponent (i.e. 2^3 instead of 2^2 to range from 0-9). The beauty of this design, and the reason we selected it, is that the output will always equal 1 when the input equals zero, regardless of the exponent used (i.e. sigmoid(0)^x = 1 regardless of x). This is an important quality that centers the results given a random weight initialization.
comparison

@glenn-jocher glenn-jocher changed the title width-height loss divergence: 'darknet' method vs. 'power' method width-height loss divergence: 'yolo' method vs. 'power' method Mar 28, 2019
@glenn-jocher glenn-jocher added question Further information is requested and removed bug Something isn't working labels Mar 28, 2019
@glenn-jocher glenn-jocher changed the title width-height loss divergence: 'yolo' method vs. 'power' method width-height: 'yolo' method vs. 'power' method Apr 10, 2019
@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 10, 2019

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

  • If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.
  • If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

yolov3/utils/utils.py

Lines 260 to 269 in f0b4f9f

# Compute losses
k = 1 # nT / bs
if len(b) > 0:
pi = pi0[b, a, gj, gi] # predictions closest to anchors
tconf[b, a, gj, gi] = 1 # conf
lxy += (k * 8) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
lwh += (k * 4) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss
# lwh += (k * 4) * MSE(torch.sigmoid(pi[..., 2:4]), twh[i]) # wh power loss
lcls += (k * 1) * CE(pi[..., 5:], tcls[i]) # class_conf loss

yolov3/models.py

Lines 158 to 165 in f0b4f9f

else: # inference
io = p.clone() # inference output
io[..., 0:2] = torch.sigmoid(io[..., 0:2]) + self.grid_xy # xy
io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh # wh yolo method
# io[..., 2:4] = ((torch.sigmoid(io[..., 2:4]) * 2) ** 3) * self.anchor_wh # wh power method
io[..., 4:] = torch.sigmoid(io[..., 4:]) # p_conf, p_cls
# io[..., 5:] = F.softmax(io[..., 5:], dim=4) # p_cls
io[..., :4] *= self.stride

@sanazss
Copy link

sanazss commented Jul 21, 2019

so all img values are change to the color default you put in the letterbox function. color = 127.5, 127.5, 127.5. they round up to 128. is it the logic behind it? because there is nothing else that could change all values of img to 128 in coco dataset. what if one uses gray images these numbers should change?

@glenn-jocher
Copy link
Member

@sanazss this is grey padding.

@bchugg
Copy link

bchugg commented Nov 14, 2019

@glenn-jocher I believe in the newest implementation there are only two places where one needs to make this change (doesn't seem like it needs to be done in build-targets). If this is right, could be useful to update your summary to avoid confusion. If this is wrong, then I'd love to know where else to make the change. My loss currently diverges if using the yolo method, but P, R, mAP and F1 all remain zero when using the power method. Odd.

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 14, 2019

@bchugg it's true that the darknet/power method is subject to divergence (as is clear from the plot above). The introduction of GIoU has mostly suppressed instances of this happening, though it still does happen on occasion, typically when GIoU hyperparameter or SGD LR are set too high.

You are correct also, the introduction of GIoU removed the w/h calculation from build_targets. I will update the 'TO SUMMARIZE' comment above!

In any case, to keep things simple for you, I would leave the wh method to the default and simply reduce your GIoU gain or SGD LR hyperparameters:

yolov3/train.py

Lines 23 to 30 in a96e010

# Hyperparameters (k-series, 57.7 mAP yolov3-spp-416) https://github.com/ultralytics/yolov3/issues/310
hyp = {'giou': 3.31, # giou loss gain
'cls': 42.4, # cls loss gain
'cls_pw': 1.0, # cls BCELoss positive_weight
'obj': 40.0, # obj loss gain (*=img_size/320 * 1.1 if img_size > 320)
'obj_pw': 1.0, # obj BCELoss positive_weight
'iou_t': 0.213, # iou training threshold
'lr0': 0.00261, # initial learning rate (SGD=1E-3, Adam=9E-5)

@glenn-jocher
Copy link
Member

I've clamped the wh output to max=1E4 now to prevent wh divergence. This should resolve the issue completely now.

pbox = torch.cat((pxy, torch.exp(ps[:, 2:4]).clamp(max=1E4) * anchor_vec[i]), 1) # predicted box

@ghost
Copy link

ghost commented Jan 10, 2020

@glenn-jocher i was implementing yolov2 from scratch for face detection and encountered the same issue as loss going nan just because of wh_loss term. So i came across this power method u have developed , and started training it using power method , but observed that exploding gradients problem is gone now but the model is not converging to an acceptable optimal state and loss is getting stagnant after some 1000 epochs to 15.xx loss. so i have some questions which i hope u can help me with :

  1. while building the targets , the w and h dimensions have to log transformed or just in the scale of grid. by doing something like this :
    g_wh = g_wh / (image_w)

  2. the power method u have referred to , does it look something like this :
    wh_loss = clamp(exp(sigmoid(g_wh)) , 1e4) * anchor_dimension .
    if not can you please correct the loss equation .

  3. During inference the wh transformation looks something like this :
    p_wh = exp(wh) * anchor_dimension .
    is this way of inferencing correct.

also , i am using MSE for regression loss

thanks.

@glenn-jocher
Copy link
Member

@agarwalyogeesh the power method seen in #168 (comment) is in units of grid points. There are no log operations. The equations are in #168 (comment)

Note that GIoU loss implementation seems to fix most of the unstable losses in the original exp wh method, and now we use this combination (GIoU loss with original exp wh).

@ghost
Copy link

ghost commented Jan 11, 2020

@glenn-jocher , thanks for your reply,
I understand that using GIOU will eliminate the unstable loss problem . But before implementing that , i wanted to experiment with MSE for regression . and wanted to ask what is K in the loss computation in this comment. #168 (comment) .

Also in the inference equation u have mentioned above , the range varies from [0 - 8] right ? .
((sigmoid(x) * 2) ** 3) , how is this range compatible with the range used in loss function as k = 1 in loss computation in that equation for power method.

The non converging of the model is what worrying me, maybe its because of less data or maybe i need more training time.

thanks,

@glenn-jocher
Copy link
Member

@agarwalyogeesh k is a hyperparameter, it's tunable. The range can vary from 0 to any number, in our case we set it to 8.

@github-actions
Copy link

github-actions bot commented Mar 8, 2020

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Mar 8, 2020
@meet-minimalist
Copy link

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

  • If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.
  • If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

yolov3/utils/utils.py

Lines 260 to 269 in f0b4f9f

# Compute losses
k = 1 # nT / bs
if len(b) > 0:
pi = pi0[b, a, gj, gi] # predictions closest to anchors
tconf[b, a, gj, gi] = 1 # conf
lxy += (k * 8) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
lwh += (k * 4) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss
# lwh += (k * 4) * MSE(torch.sigmoid(pi[..., 2:4]), twh[i]) # wh power loss
lcls += (k * 1) * CE(pi[..., 5:], tcls[i]) # class_conf loss

yolov3/models.py

Lines 158 to 165 in f0b4f9f

else: # inference
io = p.clone() # inference output
io[..., 0:2] = torch.sigmoid(io[..., 0:2]) + self.grid_xy # xy
io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh # wh yolo method
# io[..., 2:4] = ((torch.sigmoid(io[..., 2:4]) * 2) ** 3) * self.anchor_wh # wh power method
io[..., 4:] = torch.sigmoid(io[..., 4:]) # p_conf, p_cls
# io[..., 5:] = F.softmax(io[..., 5:], dim=4) # p_cls
io[..., :4] *= self.stride

Thanks for wonderful explanation and idea to solve instability in training.

I want to point out that the output range of power wh method will be in range [0-8] and that will be multiplied by the anchor width / height.

But this makes / forces the prediction to be greater than or equal to the size of anchor. Not smaller than anchor size. Our prediction should handle / predict smaller or larger boxes than the anchor. So with the current power wh method it is forced to handle / predict only larger (or equal) boxes than the anchor.

So I suggest we should use tanh function which has range of [-1, +1]. We than multiply it by 2 to make the range [-2, 2]. After we square it / cube it, we may get the output range [-4, +4] / [-8, +8].

This may solve that issue and probably provide better results.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 15, 2020

@meet-minimalist yes tanh is very similar to sigoid, and it outputs from -1 to 1.

But we are outputing a multiple of the anchor, which currently ranges from 0-inf. The exp(x) method has no upper limit on x, causing the instability, but we need an output floor of 0 in all cases.

The proposed change is (sigmoid(x) * 2) ** 3, which ranges from 0-8, and retains the same centerpoint f(0)=1 as exp(x).

@meet-minimalist
Copy link

@meet-minimalist yes tanh is very similar to sigoid, and it outputs from -1 to 1.

But we are outputing a multiple of the anchor, which currently ranges from 0-inf. The exp(x) method has no upper limit on x, causing the instability, but we need an output floor of 0 in all cases.

The proposed change is (sigmoid(x) * 2) ** 3, which ranges from 0-8, and retains the same centerpoint f(0)=1 as exp(x).

Yeah, you are correct. My bad that I forgot this is a multiplier which need to be positive in any case.
For smaller box than anchor, the prediction value will be between [0-1] and for larger box than anchor, the prediction value will be between [1-8].

Thanks again.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 29, 2021

YOLOv3 vs YOLOv5 wh method plotting code:

def plot_wh_methods():  # from utils.plots import *; plot_wh_methods()
    # Compares the two methods for width-height anchor multiplication
    # https://github.com/ultralytics/yolov3/issues/168
    x = np.arange(-4.0, 4.0, .1)
    ya = np.exp(x)
    yb = torch.sigmoid(torch.from_numpy(x)).numpy() * 2

    fig = plt.figure(figsize=(6, 3), tight_layout=True)
    plt.plot(x, ya, '.-', label='YOLOv3')
    plt.plot(x, yb ** 2, '.-', label='YOLOv5 ^2')
    plt.plot(x, yb ** 1.6, '.-', label='YOLOv5 ^1.6')
    plt.xlim(left=-4, right=4)
    plt.ylim(bottom=0, top=6)
    plt.xlabel('input')
    plt.ylabel('output')
    plt.grid()
    plt.legend()
    fig.savefig('comparison.png', dpi=200)

Figure_1

@glenn-jocher glenn-jocher added the documentation Improvements or additions to documentation label Aug 29, 2021
@jaskiratsingh2000
Copy link

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

  • If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.
  • If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

yolov3/utils/utils.py

Lines 260 to 269 in f0b4f9f

# Compute losses
k = 1 # nT / bs
if len(b) > 0:
pi = pi0[b, a, gj, gi] # predictions closest to anchors
tconf[b, a, gj, gi] = 1 # conf
lxy += (k * 8) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
lwh += (k * 4) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss
# lwh += (k * 4) * MSE(torch.sigmoid(pi[..., 2:4]), twh[i]) # wh power loss
lcls += (k * 1) * CE(pi[..., 5:], tcls[i]) # class_conf loss

yolov3/models.py

Lines 158 to 165 in f0b4f9f

else: # inference
io = p.clone() # inference output
io[..., 0:2] = torch.sigmoid(io[..., 0:2]) + self.grid_xy # xy
io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh # wh yolo method
# io[..., 2:4] = ((torch.sigmoid(io[..., 2:4]) * 2) ** 3) * self.anchor_wh # wh power method
io[..., 4:] = torch.sigmoid(io[..., 4:]) # p_conf, p_cls
# io[..., 5:] = F.softmax(io[..., 5:], dim=4) # p_cls
io[..., :4] *= self.stride

@glenn-jocher These changes have to be made before training and after these swapping has been done then we have to train if we have to train on custom data. Right?

@glenn-jocher
Copy link
Member

@jaskiratsingh2000 this issue is simply explaining the existing updates to our regression equations, no modifications need to be made in the code as the updates are already implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

6 participants