Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] What is the purpose of the variances? #53

Closed
oarriaga opened this issue Mar 10, 2017 · 11 comments

Comments

@oarriaga
Copy link
Contributor

commented Mar 10, 2017

Hello @rykov8 sorry for bothering you again,
Looking at your ssd_training.py file it seems like that variances do not take part of the training loss or even in any part of the training pipeline. However, in your ssd_utils.py, the method detection_out does change the prediction by multiplying them by their respective variance.

        decode_bbox_center_x = mbox_loc[:, 0] * prior_width * variances[:, 0]
        decode_bbox_center_x += prior_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * prior_width * variances[:, 1]
        decode_bbox_center_y += prior_center_y
        decode_bbox_width = np.exp(mbox_loc[:, 2] * variances[:, 2])
        decode_bbox_width *= prior_width
        decode_bbox_height = np.exp(mbox_loc[:, 3] * variances[:, 3])

I do understand that one has to decode the boxes since they were encoded using the transformation described in equation 2 from SSD (and faster R-CNN)

My main concern is that the variances are changing explicitly the values already outputted by the CNN without considering them directly in the training procedure; furthermore, I do not seem to find any reference, neither in the SSD or in Faster R-CNN papers, that make a reference to these variances. Maybe I am missing something in the papers or in the implementation, in that case I would be very grateful if you could tell me if I making a mistake or maybe if you could elaborate on the use of these variances.

Thank you very much.

@tachim

This comment has been minimized.

Copy link

commented Mar 10, 2017

They are taken into account in encode_box which is called by assign_boxes which is used to create the labeled data for training.

Edit: pretty sure they're used for numerical conditioning of the problem, otherwise the detection offsets would be on a different scale than the classifications and slow down the optimization.

@oarriaga

This comment has been minimized.

Copy link
Contributor Author

commented Mar 11, 2017

@tachim thank you very much! I did not see they were being used in the encode_box method. Then everything makes sense, and as you said, they are probably used for numerical conditioning.

@oarriaga oarriaga closed this Mar 11, 2017

@oarriaga

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2017

Hello @villanuevab
As far as I am concerned they scale the transformed box coordinates to make the classification task easier. They are not mentioned on the paper.

@villanuevab

This comment has been minimized.

Copy link

commented May 23, 2017

Please see the inline comments:

# the following 2 lines => g_hat for cx, cy
encoded_box[:, :2][assign_mask] = box_center - assigned_priors_center
encoded_box[:, :2][assign_mask] /= assigned_priors_wh
# here we divide by the "variance" of cx, cy of the prior boxes
encoded_box[:, :2][assign_mask] /= assigned_priors[:, -4:-2]
# the following line => g_hat for wh
encoded_box[:, 2:4][assign_mask] = np.log(box_wh /
                                          assigned_priors_wh)
# here we divide by the "variance" of wh of the prior boxes
encoded_box[:, 2:4][assign_mask] /= assigned_priors[:, -2:]

May I ask: What are these "variances" on a conceptual level and why are they used? Why not just use the \hat{g} definitions? Thanks to this thread, I understand that they are incorporated into the training pipeline, but on a conceptual level I cannot match these variances with:

  1. any concept outlined in the paper,
  2. any mathematical definition of variance that I know of.

How are these variances initialized? I.e., why are they set to [0.1] as a default in PriorBox?

I'd really appreciate some clarification here! Thank you.

@villanuevab

This comment has been minimized.

Copy link

commented May 23, 2017

@oarriaga I edited my question to be more precise; I think you addressed my confusion on why they are included in the code, even if they are not in the paper. Thank you!

Can you elaborate on why they are called "variances" and how they are initialized?

@oarriaga

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2017

I believe the term variances is misleading and they should be called something like "box_scale_factors" or at least thats how I call it on my SSD implementation

@rykov8

This comment has been minimized.

Copy link
Owner

commented May 23, 2017

@villanuevab here the author of the original paper comments about the variances. Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.

@MicBA

This comment has been minimized.

Copy link

commented Jun 8, 2017

Hi @rykov8
when i evaluate the model and look into the numbers the network return
i can see that the variance is the same always..
is that ok?
if it's suppose to be same.. way not save as default parameters and not as network output?

@stavBodik

This comment has been minimized.

Copy link

commented Nov 29, 2017

When I am testing the assign boxes function I get negative coordinates for the assigned bounding boxes which is caused due to the encoding after the IOU step inside encode_box function.

variances are coefficients for encoding/decoding the locations of bounding boxes.
The first value is used to encode/decode coordinates of the centers.
The second value is used to encode/decode the sizes of bounding boxes.

Why the encoding/decoding is needed ? is it for using less parameters while training (2 instead of 4 )? for faster optimization ? less calculations ?

@leimao

This comment has been minimized.

Copy link

commented Apr 9, 2019

In my opinion, most of the implementations on bounding box encoding/decoding with variance were conceptually incorrect. Please check my post at https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/ and provide some feedback on how you think. Thank you.

@mattroos

This comment has been minimized.

Copy link

commented May 23, 2019

Thank you, @leimao. Very helpful blog post!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.