width-height: 'yolo' method vs. 'power' method #168

100330706 · 2019-03-26T21:39:03Z

Hi! We are running your YOLO implementation into a 5 class detection task. However, it seems that at some iteration of some epoch (it is not always the same), the loss suddenly starts quickly going to infinite, giving nan values. The term that it seems that is increasing exponentially is the wh loss (wh tensor sometimes has negative values I don't know if this is normal). By applying your power method wh = torch.sigmoid(p[..., 2:4]) # wh (power method) instead of wh = p[..., 2:4] # wh (yolo method) it seems that this problem stops and the algorithm does not diverge. However, the wh loss flattens out at a higher value (around 1.07, 1.08 instead of going down to 0) than the other losses as shown below:

Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
38/59    425/1242   0.00705      1.07   0.00528  0.000134      1.09         4     0.254
38/59    426/1242    0.0071      1.07   0.00527  0.000133      1.09         4     0.258
38/59    427/1242   0.00711      1.08   0.00527  0.000133      1.09         4     0.254
38/59    428/1242   0.00712      1.08   0.00526  0.000133      1.09         3     0.251

Do you know any clue about why this could be happening? What are the supposed advantages of using the power method?

Kind regards.

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2019-03-26T21:55:54Z

The wh losses are the most unstable in the darknet implementation due to their unbounded nature, as you noticed. We created the 'power' method to stabilize the training in such situations. Typically if the wh losses diverge they do so in the first few epochs.

However, the switch from darknet to power method for the wh paremter needs to be done in multiple places in the code, not just in the loss function. This can make switching back and forth confusing, so we should probably parameterize a switch for these two methods. I'll add this to the TODO list for the next release.

IMPORTANT: If you use the default yolov3.weights you should use the default darknet wh implementation. The power method should only be used when training a new model, and then all inference done on that model must also use the power method.

This issue has a comparison plot of the two methods:
#12 (comment)

100330706 · 2019-03-27T12:08:46Z

@glenn-jocher Appreciate your response. So, occasionally not converging is supposed to be normal when using the darknet method, isn't it? Where else should be applied the power method? So far I've seen it commented in build_targets function and in the yolo layer. Would it work by uncommenting just these and commenting the darknet ones?

glenn-jocher · 2019-03-28T13:13:13Z

@100330706 yes, these are the two places you need to make the switch, and in ONNX export also if you plan to use that feature. Note that the power method currently trends from 0 to 4 as the input trends from negative infinity to positive infinity. The yolo method has unbounded outputs, causing it diverge on occasion as you've noticed.

If you feel you need higher wh anchor multiples than 4 in your scenario you can use a different exponent (i.e. 2^3 instead of 2^2 to range from 0-9). The beauty of this design, and the reason we selected it, is that the output will always equal 1 when the input equals zero, regardless of the exponent used (i.e. sigmoid(0)^x = 1 regardless of x). This is an important quality that centers the results given a random weight initialization.

glenn-jocher · 2019-04-10T14:53:18Z

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.
If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

yolov3/utils/utils.py

Lines 260 to 269 in f0b4f9f

    
           # Compute losses 
        
           k = 1  # nT / bs 
        
           if len(b) > 0: 
        
               pi = pi0[b, a, gj, gi]  # predictions closest to anchors 
        
               tconf[b, a, gj, gi] = 1  # conf 
        
               lxy += (k * 8) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i])  # xy loss 
        
               lwh += (k * 4) * MSE(pi[..., 2:4], twh[i])  # wh yolo loss 
        
               # lwh += (k * 4) * MSE(torch.sigmoid(pi[..., 2:4]), twh[i])  # wh power loss 
        
               lcls += (k * 1) * CE(pi[..., 5:], tcls[i])  # class_conf loss

yolov3/models.py

Lines 158 to 165 in f0b4f9f

    
           else:  # inference 
        
               io = p.clone()  # inference output 
        
               io[..., 0:2] = torch.sigmoid(io[..., 0:2]) + self.grid_xy  # xy 
        
               io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh  # wh yolo method 
        
               # io[..., 2:4] = ((torch.sigmoid(io[..., 2:4]) * 2) ** 3) * self.anchor_wh  # wh power method 
        
               io[..., 4:] = torch.sigmoid(io[..., 4:])  # p_conf, p_cls 
        
               # io[..., 5:] = F.softmax(io[..., 5:], dim=4)  # p_cls 
        
               io[..., :4] *= self.stride

sanazss · 2019-07-21T14:46:43Z

so all img values are change to the color default you put in the letterbox function. color = 127.5, 127.5, 127.5. they round up to 128. is it the logic behind it? because there is nothing else that could change all values of img to 128 in coco dataset. what if one uses gray images these numbers should change?

glenn-jocher · 2019-07-21T20:34:28Z

@sanazss this is grey padding.

bchugg · 2019-11-14T22:12:12Z

@glenn-jocher I believe in the newest implementation there are only two places where one needs to make this change (doesn't seem like it needs to be done in build-targets). If this is right, could be useful to update your summary to avoid confusion. If this is wrong, then I'd love to know where else to make the change. My loss currently diverges if using the yolo method, but P, R, mAP and F1 all remain zero when using the power method. Odd.

glenn-jocher · 2019-11-14T23:01:45Z

@bchugg it's true that the darknet/power method is subject to divergence (as is clear from the plot above). The introduction of GIoU has mostly suppressed instances of this happening, though it still does happen on occasion, typically when GIoU hyperparameter or SGD LR are set too high.

You are correct also, the introduction of GIoU removed the w/h calculation from build_targets. I will update the 'TO SUMMARIZE' comment above!

In any case, to keep things simple for you, I would leave the wh method to the default and simply reduce your GIoU gain or SGD LR hyperparameters:

yolov3/train.py

Lines 23 to 30 in a96e010

    
           # Hyperparameters (k-series, 57.7 mAP yolov3-spp-416) https://github.com/ultralytics/yolov3/issues/310 
        
           hyp = {'giou': 3.31,  # giou loss gain 
        
                  'cls': 42.4,  # cls loss gain 
        
                  'cls_pw': 1.0,  # cls BCELoss positive_weight 
        
                  'obj': 40.0,  # obj loss gain (*=img_size/320 * 1.1 if img_size > 320) 
        
                  'obj_pw': 1.0,  # obj BCELoss positive_weight 
        
                  'iou_t': 0.213,  # iou training threshold 
        
                  'lr0': 0.00261,  # initial learning rate (SGD=1E-3, Adam=9E-5)

glenn-jocher · 2019-11-24T01:57:50Z

I've clamped the wh output to max=1E4 now to prevent wh divergence. This should resolve the issue completely now.

yolov3/utils/utils.py

Line 342 in b027c66

    
           pbox = torch.cat((pxy, torch.exp(ps[:, 2:4]).clamp(max=1E4) * anchor_vec[i]), 1)  # predicted box

ghost · 2020-01-10T11:49:46Z

@glenn-jocher i was implementing yolov2 from scratch for face detection and encountered the same issue as loss going nan just because of wh_loss term. So i came across this power method u have developed , and started training it using power method , but observed that exploding gradients problem is gone now but the model is not converging to an acceptable optimal state and loss is getting stagnant after some 1000 epochs to 15.xx loss. so i have some questions which i hope u can help me with :

while building the targets , the w and h dimensions have to log transformed or just in the scale of grid. by doing something like this :
g_wh = g_wh / (image_w)
the power method u have referred to , does it look something like this :
wh_loss = clamp(exp(sigmoid(g_wh)) , 1e4) * anchor_dimension .
if not can you please correct the loss equation .
During inference the wh transformation looks something like this :
p_wh = exp(wh) * anchor_dimension .
is this way of inferencing correct.

also , i am using MSE for regression loss

thanks.

glenn-jocher · 2020-01-10T17:35:17Z

@agarwalyogeesh the power method seen in #168 (comment) is in units of grid points. There are no log operations. The equations are in #168 (comment)

Note that GIoU loss implementation seems to fix most of the unstable losses in the original exp wh method, and now we use this combination (GIoU loss with original exp wh).

ghost · 2020-01-11T06:15:14Z

@glenn-jocher , thanks for your reply,
I understand that using GIOU will eliminate the unstable loss problem . But before implementing that , i wanted to experiment with MSE for regression . and wanted to ask what is K in the loss computation in this comment. #168 (comment) .

Also in the inference equation u have mentioned above , the range varies from [0 - 8] right ? .
((sigmoid(x) * 2) ** 3) , how is this range compatible with the range used in loss function as k = 1 in loss computation in that equation for power method.

The non converging of the model is what worrying me, maybe its because of less data or maybe i need more training time.

thanks,

glenn-jocher · 2020-01-12T19:09:06Z

@agarwalyogeesh k is a hyperparameter, it's tunable. The range can vary from 0 to any number, in our case we set it to 8.

github-actions · 2020-03-08T00:09:24Z

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

meet-minimalist · 2020-04-15T08:22:23Z

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.

If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

yolov3/utils/utils.py

Lines 260 to 269 in f0b4f9f

# Compute losses

k = 1 # nT / bs

if len(b) > 0:

pi = pi0[b, a, gj, gi] # predictions closest to anchors

tconf[b, a, gj, gi] = 1 # conf

lxy += (k * 8) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss

lwh += (k * 4) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss

# lwh += (k * 4) * MSE(torch.sigmoid(pi[..., 2:4]), twh[i]) # wh power loss

lcls += (k * 1) * CE(pi[..., 5:], tcls[i]) # class_conf loss

yolov3/models.py

Lines 158 to 165 in f0b4f9f

else: # inference

io = p.clone() # inference output

io[..., 0:2] = torch.sigmoid(io[..., 0:2]) + self.grid_xy # xy

io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh # wh yolo method

# io[..., 2:4] = ((torch.sigmoid(io[..., 2:4]) * 2) ** 3) * self.anchor_wh # wh power method

io[..., 4:] = torch.sigmoid(io[..., 4:]) # p_conf, p_cls

# io[..., 5:] = F.softmax(io[..., 5:], dim=4) # p_cls

io[..., :4] *= self.stride

Thanks for wonderful explanation and idea to solve instability in training.

I want to point out that the output range of power wh method will be in range [0-8] and that will be multiplied by the anchor width / height.

But this makes / forces the prediction to be greater than or equal to the size of anchor. Not smaller than anchor size. Our prediction should handle / predict smaller or larger boxes than the anchor. So with the current power wh method it is forced to handle / predict only larger (or equal) boxes than the anchor.

So I suggest we should use tanh function which has range of [-1, +1]. We than multiply it by 2 to make the range [-2, 2]. After we square it / cube it, we may get the output range [-4, +4] / [-8, +8].

This may solve that issue and probably provide better results.

glenn-jocher · 2020-04-15T19:02:51Z

@meet-minimalist yes tanh is very similar to sigoid, and it outputs from -1 to 1.

But we are outputing a multiple of the anchor, which currently ranges from 0-inf. The exp(x) method has no upper limit on x, causing the instability, but we need an output floor of 0 in all cases.

The proposed change is (sigmoid(x) * 2) ** 3, which ranges from 0-8, and retains the same centerpoint f(0)=1 as exp(x).

meet-minimalist · 2020-04-16T07:52:48Z

@meet-minimalist yes tanh is very similar to sigoid, and it outputs from -1 to 1.

But we are outputing a multiple of the anchor, which currently ranges from 0-inf. The exp(x) method has no upper limit on x, causing the instability, but we need an output floor of 0 in all cases.

The proposed change is (sigmoid(x) * 2) ** 3, which ranges from 0-8, and retains the same centerpoint f(0)=1 as exp(x).

Yeah, you are correct. My bad that I forgot this is a multiplier which need to be positive in any case.
For smaller box than anchor, the prediction value will be between [0-1] and for larger box than anchor, the prediction value will be between [1-8].

Thanks again.

glenn-jocher · 2021-08-29T13:54:51Z

YOLOv3 vs YOLOv5 wh method plotting code:

def plot_wh_methods():  # from utils.plots import *; plot_wh_methods()
    # Compares the two methods for width-height anchor multiplication
    # https://github.com/ultralytics/yolov3/issues/168
    x = np.arange(-4.0, 4.0, .1)
    ya = np.exp(x)
    yb = torch.sigmoid(torch.from_numpy(x)).numpy() * 2

    fig = plt.figure(figsize=(6, 3), tight_layout=True)
    plt.plot(x, ya, '.-', label='YOLOv3')
    plt.plot(x, yb ** 2, '.-', label='YOLOv5 ^2')
    plt.plot(x, yb ** 1.6, '.-', label='YOLOv5 ^1.6')
    plt.xlim(left=-4, right=4)
    plt.ylim(bottom=0, top=6)
    plt.xlabel('input')
    plt.ylabel('output')
    plt.grid()
    plt.legend()
    fig.savefig('comparison.png', dpi=200)

jaskiratsingh2000 · 2021-10-06T15:50:36Z

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.

If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

yolov3/utils/utils.py

Lines 260 to 269 in f0b4f9f

# Compute losses

k = 1 # nT / bs

if len(b) > 0:

pi = pi0[b, a, gj, gi] # predictions closest to anchors

tconf[b, a, gj, gi] = 1 # conf

lxy += (k * 8) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss

lwh += (k * 4) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss

# lwh += (k * 4) * MSE(torch.sigmoid(pi[..., 2:4]), twh[i]) # wh power loss

lcls += (k * 1) * CE(pi[..., 5:], tcls[i]) # class_conf loss

yolov3/models.py

Lines 158 to 165 in f0b4f9f

else: # inference

io = p.clone() # inference output

io[..., 0:2] = torch.sigmoid(io[..., 0:2]) + self.grid_xy # xy

io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh # wh yolo method

# io[..., 2:4] = ((torch.sigmoid(io[..., 2:4]) * 2) ** 3) * self.anchor_wh # wh power method

io[..., 4:] = torch.sigmoid(io[..., 4:]) # p_conf, p_cls

# io[..., 5:] = F.softmax(io[..., 5:], dim=4) # p_cls

io[..., :4] *= self.stride

@glenn-jocher These changes have to be made before training and after these swapping has been done then we have to train if we have to train on custom data. Right?

glenn-jocher · 2021-10-06T18:12:24Z

@jaskiratsingh2000 this issue is simply explaining the existing updates to our regression equations, no modifications need to be made in the code as the updates are already implemented.

100330706 added the bug Something isn't working label Mar 26, 2019

glenn-jocher changed the title ~~nan values issue solved by the power method but still not optimally converging~~ width-height loss divergence: 'darknet' method vs. 'power' method Mar 26, 2019

glenn-jocher changed the title ~~width-height loss divergence: 'darknet' method vs. 'power' method~~ width-height loss divergence: 'yolo' method vs. 'power' method Mar 28, 2019

glenn-jocher added question Further information is requested and removed bug Something isn't working labels Mar 28, 2019

glenn-jocher changed the title ~~width-height loss divergence: 'yolo' method vs. 'power' method~~ width-height: 'yolo' method vs. 'power' method Apr 10, 2019

This was referenced Apr 10, 2019

Some errors and need help #196

Closed

mAP and detection not working #197

Closed

glenn-jocher mentioned this issue May 31, 2019

Divergent wh (width-height) loss #307

Closed

github-actions bot added the Stale Stale and schedule for closing soon label Mar 8, 2020

github-actions bot closed this as completed Mar 13, 2020

This was referenced Jul 21, 2020

Want to figure out critical algorithm of Detect layer ultralytics/yolov5#471

Closed

Early end #1408

Closed

glenn-jocher mentioned this issue Feb 11, 2021

A question about loss function #1685

Closed

glenn-jocher added the documentation Improvements or additions to documentation label Aug 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

width-height: 'yolo' method vs. 'power' method #168

width-height: 'yolo' method vs. 'power' method #168

100330706 commented Mar 26, 2019 •

edited by glenn-jocher

Loading

glenn-jocher commented Mar 26, 2019 •

edited

Loading

100330706 commented Mar 27, 2019

glenn-jocher commented Mar 28, 2019 •

edited

Loading

glenn-jocher commented Apr 10, 2019 •

edited

Loading

sanazss commented Jul 21, 2019

glenn-jocher commented Jul 21, 2019

bchugg commented Nov 14, 2019 •

edited

Loading

glenn-jocher commented Nov 14, 2019 •

edited

Loading

glenn-jocher commented Nov 24, 2019

ghost commented Jan 10, 2020 •

edited by ghost

Loading

glenn-jocher commented Jan 10, 2020

ghost commented Jan 11, 2020 •

edited by ghost

Loading

glenn-jocher commented Jan 12, 2020

github-actions bot commented Mar 8, 2020

meet-minimalist commented Apr 15, 2020

glenn-jocher commented Apr 15, 2020 •

edited

Loading

meet-minimalist commented Apr 16, 2020

glenn-jocher commented Aug 29, 2021 •

edited

Loading

jaskiratsingh2000 commented Oct 6, 2021

glenn-jocher commented Oct 6, 2021

width-height: 'yolo' method vs. 'power' method #168

width-height: 'yolo' method vs. 'power' method #168

Comments

100330706 commented Mar 26, 2019 • edited by glenn-jocher Loading

glenn-jocher commented Mar 26, 2019 • edited Loading

100330706 commented Mar 27, 2019

glenn-jocher commented Mar 28, 2019 • edited Loading

glenn-jocher commented Apr 10, 2019 • edited Loading

sanazss commented Jul 21, 2019

glenn-jocher commented Jul 21, 2019

bchugg commented Nov 14, 2019 • edited Loading

glenn-jocher commented Nov 14, 2019 • edited Loading

glenn-jocher commented Nov 24, 2019

ghost commented Jan 10, 2020 • edited by ghost Loading

glenn-jocher commented Jan 10, 2020

ghost commented Jan 11, 2020 • edited by ghost Loading

glenn-jocher commented Jan 12, 2020

github-actions bot commented Mar 8, 2020

meet-minimalist commented Apr 15, 2020

glenn-jocher commented Apr 15, 2020 • edited Loading

meet-minimalist commented Apr 16, 2020

glenn-jocher commented Aug 29, 2021 • edited Loading

jaskiratsingh2000 commented Oct 6, 2021

glenn-jocher commented Oct 6, 2021

100330706 commented Mar 26, 2019 •

edited by glenn-jocher

Loading

glenn-jocher commented Mar 26, 2019 •

edited

Loading

glenn-jocher commented Mar 28, 2019 •

edited

Loading

glenn-jocher commented Apr 10, 2019 •

edited

Loading

bchugg commented Nov 14, 2019 •

edited

Loading

glenn-jocher commented Nov 14, 2019 •

edited

Loading

ghost commented Jan 10, 2020 •

edited by ghost

Loading

ghost commented Jan 11, 2020 •

edited by ghost

Loading

glenn-jocher commented Apr 15, 2020 •

edited

Loading

glenn-jocher commented Aug 29, 2021 •

edited

Loading